Precipitation Regionalization Using Self-Organizing Maps for Mumbai City , India

The detailed analysis of individual rain events characteristics is an essential step for improving our understanding of variation in precipitation over different topographies. In this study, the homogeneity among rain gauges was investigated using the concept of “rain event properties,” linking them to the main atmospheric system that affects the rainfall in the region. For this, eight properties of more than 23,000 rain events recorded at 47 meteorological stations in Mumbai, India, were analyzed utilizing seasonal (June-September) rainfall records over 2006-2016. The high similarities among the properties indicated the similarities among the rain gauges. Furthermore, similar rain gauges were distinguished, investigated and characterized by cluster analysis using self-organizing maps (SOM). The cluster analysis results show six clusters of similarly behaving rain gauges, where each cluster addresses one isolated class of variables for the rain gauge. Additionally, the clusters confirm the spatial variation of rainfall caused by the complex topography of Mumbai, comprising the flatland near the Arabian Sea, high-rise buildings (urban area) and mountain and hills areas (Sanjay Gandhi National Park located in the northern part of Mumbai).


Introduction
Rainfall is an essential boundary condition for the design and operation of urban drainage systems.Compared with natural hydrology, detailed knowledge of the distribution of precipitation characteristics such as duration, depth, and intensity, is more vital for operating such systems due to small spatial-scale involvement and short reaction times from rainfall to runoff.In Mumbai, the occur-represent the time series of rainfall amounts in the form of entities (rain events) that can be employed in specific applications [2].Such individual rain events could also overcome the problem of a lack of long-term precipitation data at a specific temporal and spatial resolution [3] [4].
Based on a comprehensive review, the time series of rain gauge records is broken down into separate rain events by a defined dry period between rains referred to as the minimum inter-event time (MIT), which is often combined with minimum event depth [2] [5].Dunkerley [6] noted that MIT values ranged from 3 min to 24 h, with minimum event depths ranging from 0.1 mm to 13 mm.Additionally, statistical techniques, such as autocorrelation analysis [7] [8], and multifractal and self-organized criticality theories [9] have been used.Many researchers [4] [10] [11] have used the assumption that the probability density of inter-event times (IETs) can be adequately represented by an exponential distribution using the coefficient of variation (COV) method.
Therefore, the objective of this research was to study the spatial characterizes of individual events.Hence, to identify individual events at each station using MIT values calculated by the COV method.In addition to MIT values, numerous properties of individual events, such as total depth, total duration, mean rain rate, maximum rain rate, peak timing, previous inter-event time, a fraction of event depth until event maximum and fraction of intra-event rainless periods, can be computed.The interdependence between these properties and IETs can be studied to understand precipitation characteristics in various regions, and this interdependence has been analyzed in various countries by researchers (e.g., Blue Nile River, Ethiopia, [4]; Italy, [12]; Illinois, [13]; the Czech Republic, [10]; Malaysia, [14] [15]; Australian drylands, [16]).In this study, we have used eight properties (used by the many researchers) to characterize rainfall.
The identified properties used for rainfall characterization can further be employed to cluster the rain gauges.Cluster analysis techniques have been used by researchers in various parts of the world including India, e.g., correlation analysis [17] [18], principal component analysis [19], cluster analysis [20] [21], neural networks [22] and shared nearest neighbor (SNN) clustering [23]; however, most of these studies have been conducted on a national and not a regional scale.
Apart from the abovementioned techniques, self-organizing maps (SOMs) are a promising technique for cluster analysis [24] [25].In this study, cluster analysis involved the use of SOMs to cluster rain gauges in groups (regions) such that the comparability of rain gauges within a region is augmented while the similitude of those between regions is limited.
The research presented in this paper is motivated by the fact that, most of the studies on this region used data from two Indian Metrological Department (IMD) stations (Colaba and Santacruz) to investigate the formation and prediction of extreme events and design rainfall estimation, and that, until now, there have not been any reports related to MIT, rain event property analysis and rain gauge clustering using sub-hourly data.For example, Sherly et al. [26] proposed a structure for design precipitation estimation utilizing a multivariate semiparametric approach, while Sen et al. [27] used the IMD rainfall dataset of 43 years and performed on-site design rainfall estimation for quantifying spatiotemporal variability used in [27].Additionally, Nayak and Ghosh [28] used support vector machine and statistical techniques based on machine learning to predict extreme rainfall.Singh et al. [29] noted that the unusual pattern of spatiotemporal behavior of extreme rainfall might be related to various physical variables, such as the Indian Ocean Dipole, the El Nino-Southern Oscillation along with the East Atlantic Pattern and the high-speed wind blowing from the Arabian Sea conveying abundance dampness to the surrounding regions.Their results were based on a study of the spatiotemporal characteristics of extreme rainfall using two years of data from 26 rain gauges.Therefore, considering that previous studies have not focused on rain event properties, the present study conducts an effective investigation of rain event properties and their interdependence on IETs, with Mumbai, India, as the study region.This study also seeks to answer the following questions: 1) What is the appropriate MIT for the entire Mumbai region using the COV method?
2) How can we analyze and understand the spatial characteristics of rainfall properties using SOMs?
3) Which rain gauges have similar properties, i.e., lie in the same cluster?
To answer the above questions, the sub-hourly data from a dense gauge network in Mumbai were analyzed.The paper is structured as follows.Section 2 provides a brief description of the study area, data source, and rain gauge characteristics.Section 3 describes the methodology used for the analysis of MIT, rain event properties and clustering using SOM.Section 4 presents the primary findings and their significance.Finally, Section 5 concludes the paper.

Study Area and Data Description
Mumbai is situated on the western coast of India and extends between 18.00˚ -

Methodology
This section describes the methodology used to achieve the objectives of this study.

Data Preparation and Determination of MIT
In this study, continuous rainfall data recorded by the rain gauges at the meteorological stations were analyzed a sub-hourly basis.Furthermore, to ensure the  reliability of the data and to identify suspect or incorrect values, a validation method was used.In the validation process, the data were not modified but supplemented with an appropriate flag, namely, "Suspect" and "Missing" followed by the application of a range test [30] and double mass curve.Only stations with more than three years of data were considered in the process.The entire procedure resulted in the inclusion of 47 stations, with recorded data time varying from 4 to 11 years.To assuarence, a near-uniform spatial distribution of regions, marginally deficient records with minor gaps in the sub-hourly precipitation records (splits of up to a few days) was utilized.
Precipitation is recorded using a rain gauge, which measures the amount of water, precipitated by clouds, reaching the ground.Based on homogeneity, the precipitation records can then be clustered into independent events.To Separate and interpret independent events from the time series rain gauge records is subjective.For example, in Figure 2, for groups A, B, and C, either each group can be considered as an independent one or all three groups may be part of one event.Another probability is to club A and B together and consider it as an event and Group C to be another event.Hence, to calculate the start and end times of such independent events, an MIT estimate is required.Accordingly, two tempests isolated by a rainless period with values less than a specified MIT value are considered as a single event, as shown in the left-hand corner of Figure 2 and vice versa on the right side of Figure 2. In the present study, optimal MIT was estimated using the COV method proposed by Restrepo-Posada and Eagleson [31] and applied by several researchers.This method proposed a simple check using the COV of IETs.Assuming that the distribution of IETs to be roughly exponential, which implies equal mean and standard deviation, the COV ought to be 1.Therefore, MIT values are systematically modified, and the MIT is prompting COV = 1 is identified as ideal.

Self-Organized Maps (SOM)
The vital information contained in each rain events should be extracted using a Journal of Water Resource and Protection limited set of well-chosen properties.However, there is no standard or generally acknowledged list or a particular set of properties that can be utilized to precisely depict and abridge an event.Based on literature reviews, the often-used properties in various studies are duration, depth, mean rain rate, maximum rainfall intensity and intra-event dry periods and hydrology studies have used event peak intensity for an aggregation scale at different time steps [10] [14] [24] [32].We also include the property describing the fraction of intra-event rainless periods, previous inter-event time and the position of the peak in the event (time of peak and fraction of event depth until event maximum), which are relevant overland flow generation, runoff, and infiltration [33] [34] [35] [36].In this study, eight properties were obtained for each of the rain events derived from the 47 rain gauges, and these are listed in Table 2.
The event characteristics at each station described by the properties (Table 2) can be seen as a vector that has a high-dimensional data space.We used a flexible, data-mining, SOM method [37] [38] for exploratory data to analyze such high-dimensional data spaces of the studied rain gauges.A SOM is an unsupervised learning algorithm based on artificial neural networks that produce a low-dimensional representation of a high-dimensional input dataset.SOMs can be used for a variety of operations in exploratory data analysis, such as clustering, data compression, non-linear projection and pattern recognition.
A SOM comprises cells that are sorted out on a regular grid.Each cell is drawn by a d-dimensional weight vector and associated with nearby cells by a relation, which decides the structure, i.e., the topology of the resulting SOM.The SOM is then created through iterative training; input vectors relating to data samples in the given data matrix are randomly picked in each turn, and the distances between them and all weight vectors of the SOM are computed.The cell that has a weight vector nearest to the input vector in question is the input vector's best-matching unit (BMU).After obtaining the BMU, the weight vector is

D
The fraction of intra event rainless periods updated so that the BMU and its neighbors are moved towards the input vector.
The SOM is then trained with the net effect of the whole dataset by the batch algorithm, which computes an average of the data samples weighted by the neighborhood function of each data sample at its BMU.
In this paper, we run the SOM tool in MATLAB to create a SOM using the SOM algorithm as described in Vesanto et al. [39].The training dataset considered in this analysis comprises eight properties, along with three geographic coordinates-x, y and z positions-of each rain gauge.The final dataset has an overall dimension of 11 attributes and a size of about 23,000 data samples.Prior to the training, the data were normalized to a [0 1] interval.To avoid the disproportionate influence of the high index values on the training, linear transformation of the data is carried out.On the other hand, geospatial data posed some important features of nature that occurs at all scale, also perceived by its gradual, fuzzy or vague changes raise some issues in the utilization of the SOM algorithm.However, these issues can be resolved using recommended approaches such as data pre-processing (normalization and attribute weighting) and geoinitialization [40] [41].Therefore, before starting the analysis, the data were normalized, and to preserve the geospatial aspects, the x variable was scaled down using the ratio max(x)/max(y) and was weighted with a value of eight.There are two phases in which the SOM algorithm training is usually performed, viz.rough and fine-tuning.Relatively large initial learning rate (0.5) and neighborhood radius (3) are used in the rough tuning phase while in the fine-tuning phase, both the above values are taken smaller right from the begin-ning (0.05 and 1).But, before the algorithm is applied and analyzed, there are few parameters (Table 3), that need to define.All these parameters together play an important role in the SOM algorithm as they could influence the result obtained.The parameters used for training the SOM are summarised in Table 3.The MATLAB package proposed for the data a map of size 49 × 15, based on the hexagonal lattice.However, for an easier presentation of the map, we built a smaller map of size 16 × 5 in this study.This brought about a decay of the quantization error from 0.420 to 0.559 and enhancement of the topological error from 0.126 to 0.052.
A perfect SOM analysis creates such apparent outcomes that envisioned maps could be dependably deciphered simply by taking a look at them, even though extra apportioning that utilizes SOM as a halfway step is often prescribed to obtain more precise outcomes [39] [42].In this study, popular visualizations of SOMs, such as U-matrix, component plans, assignment of rain gauges to neurons and the distribution of index properties represented by bar charts, are used.Furthermore, the study uses hierarchical clustering, an unverified method, for clustering the SOM [43].The approach begins with single data points as individual clusters, and at each progression, each cluster consolidates with the nearest pair of clusters until one cluster remains.Hence, the approach is also known as the agglomerative approach and calls for the definition of cluster proximity.In this investigation, cluster proximity is characterized by the average pairwise proximity among all sets of points in various groups and is represented by the average group distance.The outcome is called a dendrogram which is a tree-like diagram.A dendrogram shows both the cluster and sub-cluster relationships and the order in which the clusters were consolidated.The closeness of the clusters can be depicted by lengths of the limbs, and the data items can be clustered by cutting the dendrogram.

Minimum Inter-Event Time (MIT)
The MIT values were estimated using the COV method.In this approach, for all stations, the MIT values varied from 15 min to 24 h.Furthermore, events for each MIT were identified, and the COV of the IETs was obtained.The outcomes are shown in Figure 3, where COV is noted to decrease with increasing MIT.
The approximate values of MIT for which COV = 1 was assessed; Table 1 shows the appropriate MIT values for all stations.The average MIT for the study area was noted to be 5 h appear reasonable as it was reported that MIT with less than 6 h could be suggested for the urbanest application [4] [14] [16], with the minimum value being 2h at F North station and the maximum values being eight at Workshop Kandivali station.The average annual number of storm events that occurred in Mumbai during the southwest monsoon season in the studied period at each station is given in Table 1; the average annual number of storms events varies from 25 to 121.

Self-Organized Maps (SOM)
The resulting dataset considered in this analysis comprises eight essential properties, along with three geographic coordinates-x, y and z positions-of each The U-matrix (Figure 4(a)) demonstrates two distinctive parts of the map: blue-colored areas in the southwest part indicate units with a high level of similitude, which can be viewed as discrete clusters.A column of red and orange colors at the focal point of the south side isolates the neurons from the rest of the map and forms a cluster border.Further, we can visualize that there are at least 4 -5 clusters in the data.The SOM grid with the assignment of 47 available rain gauges to neurons (Figure 4(c)) indicates that the gauges marked together on one neuron can be considered as having comparable behavior and form the smallest unit of a cluster.The SOM reduced the variability of the 47 rain gauges to 40 neurons.Additionally, dendrogram cluster analysis was applied to the SOM to reduce the number of clusters further.By assigning the rain gauges to neurons (Figure 4(c)), we can group the rain gauges into six sensible clusters-C1 to C6.The spatial distribution of the clustered rain gauges (Figure 5) shows that rain gauges with similar response behavior are grouped together.
A marginal rain gauge is identified as a rain gauge that is labeled to neighboring

Cluster Analysis
In this study, we assumed that two or more rain gauges behave   4(b)) of a fraction of event depth until event maximum and time of peak highlighted that the events in this area received the maximum amount of rain before reaching high intensity with peaks in the second or third quartile.This is a result of the southwestern trade winds that carry significant moisture inland from the Arabian Sea in the south.Due to the high moisture, this region experiences heavy monsoonal rainfall and hence has a typical tropical monsoon climate.
Cluster 2 (C2) comprises nine sites, covers the southwest and extends further inland.This cluster represents lowland, urban area with high-rise buildings and is characterized by a large duration, very high intensity and high amount of rainfall.This region experiences peaks of rainfall in the first quartile of the duration (low time of peak value), with most of the rainfall occurring after the peak.
This may be due to the high-rise buildings in this region.
Cluster 3 (C3) consists of four sites located between the Sanjay Gandhi National Park and Chembur hills.While this cluster is not similar to cluster 2, it exhibits similar characteristics such as average duration, very high intensity and high amount of rainfall.This cluster does not present a clear picture of the rainfall peaks within an event, which may be due to the funneling action by the hills.demonstrate that this region experiences minimal rainfall regarding frequency, duration, and intensity (see Table 4).Additionally, the low value of properties (Rmax, a fraction of event depth until event max and depth) highlighted that the events received by this region are smooth with small numbers of sharp peaks.
This may be due to the shadowing effect of the hills, resulting in a decrease in rainfall amount.
The results confirm the effect of complex topography, namely, the flatland near the Arabian Sea, high-rise buildings (urban area), mountain and hills areas (Sanjay Gandhi National Park located in the northern part) on the spatial variation of rainfall.The results highlighted the rain gauges within the cluster 2 (C2) located in the urban pocket received the intense precipitation, which supports the findings of Paul et al. [44], shows the urban signature for extreme precipitation will be reflected on rainfall recorded by the stations only when the stations are located within the urban pockets affected by intense precipitation.Considering the following factors, we did not regroup the sites further: 1) To evade loss of valuable information on precipitation during the analysis of dissimilar sites (M-10, M-27, and M-35), we did not exclude these sites.
2) To evade extraordinary events is unlikely since we consider it to be an intolerable hindrance for the process of evaluating it independently.

Conclusions
The primary objective of this research was to study the spatial characteristics of rain events in Mumbai.This was achieved by computing the SOM with 11 properties, represents the characteristics of rain events, analyzed using sub-hourly rain gauge data from 2006-2016.The following results and conclusions can be drawn as: 1) This study emphasizes the need for event analysis over the traditional usage of sample analysis and brings out the advantages of event analysis in the case of limited data availability.The sample analysis involving excessively long integration times (hours or days) usually consists of rainy and clear sky periods in the same sample.Furthermore, it consists of distinct physical processes of the rainfall event.Such long integration times can lead to the mixing of rainy and clear sky observations.On the other hand, the data acquired in a short time (minutes) are sensitive to the sensor's characteristics (detection threshold, sensor area, and

Figure 1 .
Figure 1.Unique codes and locations of rain gauges at meteorological stations throughout Mumbai with SRTM.

Figure 2 .
Figure 2. Diagram illustrating the classification of rain events using inter-event times.

Figure 4 .
Figure 4. Representations of a SOM: (a) U-matrix: neurons of the SOM are labeled by numbers, indicating structures formed by visualizing distances between neighboring neurons and, on additional hexagons between neurons, medium distances between two neurons; (b) Component planes for each index display mean values of each vital property on the neurons of the SOM; (c) Assignment of rain gauge using neurons labels; gauge IDs correspond to their BMU.

Figure 5 .
Figure 5. Location and clustering of each rain gauge using SOM.
similarly and are grouped by training a SOM and implementing hierarchical clustering.With the information from the component planes (Figure 4(b)) and the distribution of properties for each neuron, we can characterize each cluster by a unique combination of aspects of the rain event characteristics, listed in

Cluster 4 (
C4) is located predominantly in the western part of the city between the Arabian Sea and Sanjay Gandhi National Park.It is characterized by low previous IET, high precipitation intensity, large duration and average depth events, with maximum intensity during the first quartile of the event.Furthermore, the value of the property (fraction of event depth until event maximum) varies in this region between 0.4 -0.7 indicates that most of the storm events may receive an equal amount of rainfall before and after a high-intensity, which Journal of Water Resource and Protection may be attributed to the high wind gusts caused by the funneling action by the hills.Cluster 5 (C5) consists of 10 sites located in the flatland that lies between the Arabian Sea and the windward side of the Sanjay Gandhi National Park predominantly in the northern part of the city.The characteristics of this cluster are similar to those of cluster 3, except that this cluster has a very low IET, and have a high time of peak value.Cluster 6 (C6) comprises four sites, located predominantly in the northeastern part of the Sanjay Gandhi National Park on the leeward side of the hills.Results

Table 1 .
Basic characteristics of rain gauges at meteorological stations throughout Mumbai.

Table 2 .
Variables are characterizing rain events in this study.

Table 3 .
Parameters for SOM analysis.

Table 4 .
Characteristics of clusters C1 to C6 based on rain event properties.