Statistical Analysis of Precipitation Extremes in São Francisco River Basin, Brazil

This study comprises a climatology of the spatial variability of precipitation over the São Francisco River Basin (SFRB), characterized by its geographic heterogeneity. The different rainfall regimes in the region were analyzed through statistical and spectral analyses. Measured precipitation data, Pacific Decen-nial Climate indexes, ENSO, Atlantic Multidecadal Oscillation, North Atlantic Oscillation, Atlantic dipole, and the sunspot cycle over 65 years were used. The rainfall data were filtered and filled in using the regional weighting method. The spatial and temporal variability of precipitation along the SFRB is remarkable. A pattern was observed along with the time series of precipitation over the SFRB. The cluster analysis identified four homogeneous regions in the SFRB and explained 87.4% of the total variance of the average monthly rainfall of the 199 rain gauges. The Cross-wavelet analysis identified the relationship between the precipitation data series and the climatic indexes that are analyzed in this work.


Introduction
The hydrological cycle describes the natural flow of water in its liquid, solid and gaseous states in the atmosphere, hydrosphere, cryosphere, lithosphere, and biosphere. Water volumes vary in quantity and quality through the earth's system, unlimited in the Oceans and null over large desert areas of the lithosphere [1] [2] [3].
The terrestrial branch of the hydrological cycle is of great interest at the watershed scale [4]. The watershed is a region where the precipitation converges to São Francisco (Pernambuco, Alagoas and Sergipe) [14] [15].
The production of rainfall in this region comes from the South Atlantic Convergence Zone (SACZ), Frontal Systems (FS), Wave Disturbances of Tradewinds, South Atlantic Subtropical High, sea and land breezes [13] [16]. The SFRB is heterogeneous with four climatic types; at the headwaters, there is the tropical altitude, in the central region between Minas Gerais and Bahia the tropical type, between the north of Bahia and east of Sergipe and Alagoas the semi-arid and at the mouth the humid coastal type [17] [18].

Database
The precipitation data series of the National Water Agency (ANA) for SFRB was

Precipitation Data Quality Control
ANA's precipitation database comprises 10,637 rain gauges in the entire Brazil.
Precipitation data quality is a challenge since many precipitation time series lack consistency, have missing data, and are short in length. Furthermore, the rain gauge network is not evenly distributed as shown in Figure 2(a) as in [18]. The rain gauge network is sparse in Northern Brazil, especially in the Amazon, and denser in parts of Northeast (Ceará State) to and South (Paraná State) and Southeast (São Paulo and Rio de Janeiro States). Thus, the whole ANA precipitation database passed through a statistical data control procedure that reduced the time series to 3427 time series with the spatial distribution shown in Figure   2(b). The selected time series begin in 1951.
Missing precipitation data is a cumbersome problem [19]. [20] used data filling methods such as regression equations with the least square adjustment for allseason available information and the regionally weighted method that is based on weighted averages of three or more neighbor rain gauge time series. This latter method yielded better results and corroborates with [21]. This latter method was used to fill missing precipitation data by using three neighbors (n = 3) where, x P is the precipitation estimation for the missing monthly data; i P is the precipitation of the i th neighboring rain gauge; ( ) x i Pm is the long-term time average of the x(i) precipitation time series.

Cluster Analysis
Cluster analysis classifies individuals that participate in the same group given their similar characteristics or homogeneity. Individuals with similar analogies are classified as similar and those with heterogeneity are called dissimilar. To identify the similarity between individuals, it is quantified by the proximity of the similar and the dissimilar [22]. If the values are greater than zero, that is, the larger this number the more it will be similar and for values close to zero it will not be similar. And for dissimilar, it is the reverse, the higher the measured data, the less similar, and the smaller they are, the more similar [22]. So, cluster analysis separates data into groups whose identities are not known in advance. This more limited state of knowledge contrasts with the situation of discriminating methods that require a set of training data by which participation in the group is known. Cluster analysis is primarily an exploratory data analysis tool. Given a sample of x data vectors by defining the rows (n × K) of a data array [X], the procedure will define groups and assign associations to groups at different aggregation levels [23].

Principal Component Analysis (PCA)
Pearson in 1901 started and Hotelling in 1933 expanded on PCA. It is a multivariate technique used to highlight and to identify variations in the dataset with strong patterns so to facilitate its interpretation and visualization [22] [24].
According to [23], this multivariate statistical technique is widely used in Meteorology. And it became popular by [25] who analyzed atmospheric data and

Precipitation Anomaly
Precipitation anomaly is defined as the deviation from the long-term annual precipitation average: where, A is the precipitation anomaly; i P is the monthly precipitation; m P is the long-term precipitation average for the month i.
The anomaly indicates the variability and extreme fluctuations in a given time series with marked deviations outside the observed sample of meteorological buoyancy [26]. It highlights the variability of precipitation regimes at different time scales and the analysis of cycles that change it [27].

Cross-Wavelet Analysis
The cross-wavelet transform (OCD) analyzes similarities and correlations among time series of variables to identify possible incoherence (out of phase) [28] [29] [30] [31]. The OCD analyzes periodic and non-stationary two-time series that might be related. It is based on the wavelet transform [32].  . Phase difference and its interpretation. From [34]. The relationship between phases is given by the arrows in the OCD diagram. means that cycles move together at certain periods [35] [36] [37]. A phase difference of ±π indicates that time series cycles are shifted by 180˚, i.e., representing a perfectly negative correlation (y or x leading). The arrows pointing up mean that the y (or x) time series is ahead of the second in 90˚, while the downpointing arrows indicate that x (or y) of the time series is ahead of the first in 90˚

Boxplot
The boxplot diagram in Figure 4 is a quantitative method of AED also known as schematic drawing defined by the first and third quartiles and by the median.
For interpretation, there are two limits one lower and one higher, which is below or above these limits are called outliers, the median that is the central value [38].
The minimum, median and maximum values are the 1 st , 2 nd , and 3 rd quartiles, representing 25%, 50%, and 75% of the dataset. The outliers are the extremes [39].

Standardized Precipitation Index
The Standardized Precipitation Index (SPI) is a widely used methodology to quantify droughts and rainfall extremes of a given study area. [40] analyzed the normalized monthly precipitation data using the probability distribution function that describes the time series. Table 1 shows SPI values for wet (positive) and dry (negative) conditions. Drought and floods begin at SPI = 1 and SPI = −1, respectively. SPI lower than |1| indicates normal conditions [31] [41] [42].
According to the European Drought Observatory [43], the SPI is used for detecting and distinguishing drought conditions. The precipitation anomaly for a

Results and Discussion
Precipitation Analysis in the São Francisco River Basin Figure 5 shows the location of the 199 rain gauge time series selected within the SFRB to perform a diagnostic statistical. Figure 6 shows the spatial-temporal            km 2 with a river length of 208 km [44]. The monthly average precipitation in RH1 (Figure 12(a)) is higher in March, April, July, and August and lower in September, October, January, and February ranging from 24 mm to 99 mm. The RH2 region covers an area of 155,637 km 2 , with a river length of 42 km [44]. The average monthly precipitation (Figure 12(b)) is distinct from RH1 with the highest precipitation between December and April and the lowest between May and November. The RH3 region covers an area of 337,763 32,013 km 2 with a river length of 1300 km [44]. The average monthly precipitation is highest between November and March, and the lowest from April to October ( Figure   12(c)). Finally, the RH4 region covers an area of 111,804 km 2 km [44]. The average monthly precipitation is similar to the RH3 region ( Figure 12(d)).  1957,1960,1963,1966,1975,1981,1985,1992,1994,1996,2002,2004 1955,1970,1980,1983,1992,1993,1999,2000,2001,2002,2003 (Figure 13(b)) the negative anomalies are less variable while larger positive anomalies occurred in 1957,1960,1980,1981,1985,1989,1992,2002 1964,1969,1974,1978,1979,1980,1981,1985 (  1953,1954,1956,1963,1963,1970,1976,1977,2012,2014 mm and 95 mm, respectively. All these negative rain anomalies occurred in the rainy season. The cluster analysis used to identify the four homogenous regions ( Figure   11(b)) yielded eight common factors shown in Table 2 that explained 91.2% of the total variance of the data, though the first 4 factors were explained 87.4% of the monthly average precipitation of the 199 rain gauges precipitation time series. They were used to determine the homogenous regions in the SFRB.
The spatial patterns associated with the first four factors of Table 2

Spectral Analysis of the SFRB
Spectral analyses between the monthly precipitation average anomalies of the entire SFRB and oceanic indices were obtained. Figure  years, and >16 years. The 1.5-year core at the end of the 1950s shows two signs, one where the NOI leads over the precipitation in opposite phase or negative NOI and another sign that the series move opposite with perfectly negative correlation ( Figure 15(d)). The core centered between 3 and 7 years indicates that the precipitation anomaly leads the NOI or the precipitation anomaly occurs 3-months before the NOI. But at the beginning of the 60s, the NOI led over the        The remarkable core between the intraseasonal and four-year cycles in the early 50s, in the annual cycle in Figure 19, shows the vector pointing to the 4th quadrant that means AD led (y) over the precipitation anomaly (x) with a negative correlation and in phase up to 3-year periodicity. Between 1955 and 1965, a high-power core is in between the annual and 4-year cycle from the late 50s to 1962. They moved together between 1-year and 2-year cycles with signal changes around 1965 and the DA leading. As it approached the 2-year cycle, the precipitation anomaly was ahead of the AD. In between 2-year and 4-year cycles, the series moved opposite to each other. Another significant core is observed around the 7-year cycle from 1970 to 1983, with a perfectly negative correlation, and the annual cycle tending to move together with the precipitation anomaly leading. The cores between 1990 and 2015 between the intraseasonal and 2-year cycle, shows the precipitation leading, except between 2014 and 2015 with AD leading over the precipitation anomaly. In the 2-year and 4-year cycles, the precipitation anomaly led from the 90s to the 2000s. Figure 20 shows another remarkable dominancy of the 11-year cycle as well

Conclusion
The statical analyses of the SFRB were challenging due to its latitudinal extended and geographic location under the influence of weather systems from the convective scale (e.g., sea breeze) to the large scale (e.g., ITCZ) that result in different precipitation regimes at different spatial-temporal scales.