Precipitation Extremes Analysis over the Brazilian Northeast via Logistic Regression

This work diagnosed the precipitation extremes over the Brazilian Northeast (NEB) based on logistic regression for obtaining associations between precipitation extremes and the meteorological variables by Odd Ratio (OR). Data of ten meteorological variables to the NEB (North (NNEB), East (ENEB), South (SNEB) and Semiarid (SANEB)) were used daily. The OR results evidenced that the outgoing longwave radiation was the key variable on the precipitation extremes detection in three sub-regions: ENEB with 2.91 times (95% confidence interval (CI): 2.11, 4.02), NNEB with 3.63 times (95% CI: 1.93, 6.83), and SANEB with 5.40 times (95% CI: 3.04, 9.61); while on SNEB, it was relative humidity with 3.88 times (95% CI: 2.89, 5.20) more chance to favor the precipitation extremes. The maximum temperature, zonal wind component, evaporation, specific humidity and RH also had influence on these extremes. Goodness-of-fit and ROC analysis demonstrated that all models had a good fit and good predictive capability.


Introduction
The increase of extreme events in a short period of time became in the society more vulnerable at weather and climate extremes variability, resulting in great socioeconomic losses [1].These extremes are related to several environmental factors that favor the increase on their frequency and intensity: 1) ocean-atmospheric variables relationships, such as: air temperature [2], precipitation [3], wind speed [4] and sea surface temperature (SST) [5]; 2) regional micro-climate changes due to rapid urbanization of the cities without proper urban planning [6]; and 3) orographic effects [7].These factors when combined at atmospheric circulation or meteorological systems in several spatiotemporal scales [8] can favor the extremes occurrence; the aim of this paper is to diagnose the precipitation extremes.
These extremes have motivated various researchers on seeking to detect associations between precipitation extremes and environmental factors, as [9] that investigated precipitation extremes to future scenarios and detected interaction between temperature and water vapor that propel the precipitation extremes, already in tropical regions these extremes are motivated by specific humidity saturation in low levels.The relationship of temperature and specific humidity was also found by [10].
On Brazilian Northeast, these extremes are related at precipitation by deficit (semiarid region) or excess (capitals or coast regions) found by several researchers [11][12][13].These extreme events directly or indirectly affect with great socioeconomic losses caused by flash floods or by prolonged droughts [14].
Several statistical methods have been implemented to extract patterns on these extremes, aboard of these papers generalized linear models (GLM) [15,16].The GLM is a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution.Furthermore, these models are effective and robust that it will facilitate to obtaining the precipitation extremes by logistic regression models.In climate sciences, the logistic regression application is related to precipitation occurrence or amount models [17,18], and forecast skill verification of the climate models [19].Thus, this approach via odds ratio (OR) will pretend to answer the questions: 1) Does SST determine the precipitation extreme occurrence?
2) Which is (are) variable(s) that favor(s) the precipitation extreme occurrence?
3) What is the magnitude of these associations via OR on extreme intensification?
4) What will be the OR's behavior similar in all the NEB sub-regions?
In order to answer these questions, the goal of this article is to characterize the precipitation extremes on NEB via logistic regression model, using ten meteorological variables for 1979-2011.
Cluster analysis was performed to characterize the new NEB precipitation pattern using Euclidean distance via Ward method, resulting in four sub-regions: ENEB, LNEB, SNEB and semiarid (SANEB) as the new sub-region shown on Figure 1.

PCA for Atlantic and Pacific Regions
We also use daily SST data of the two tropical regions: Atlantic Ocean (ATL) (21˚S -21˚N, 57˚W -15˚E) and Pacific Ocean (NINO), El Niño 1.2, 3, 3.4 and 4 regions (5˚S -5˚N, 90˚W -160˚E, and 10˚S -0˚, 80˚W -90˚W) provided from the Era-Interim reanalyzes.The SST in daily timescale is little used due at low degree variability, but will be important for Poisson regression build.
The SST data implementation of variables explanatory follows these stages: 1) The inclusion of lags for both basins: Atlantic-30 days, and Pacific-90 days due at ocean time response; 2) To calculate the anomalies for each SST regions; and 3) Principal Components Analysis (PCA) to extract the main pattern behavior.
The categorization of variables following two criteria: 1) precipitation data upper 95th percentile (>95 p) values was considered as extreme; 2) for OLR, OLR below 240 Wm −2 (OLR < 240) was considered as convective clouds; 3) for the others variables, the threshold was considered as abnormal those quartile that it obtain higher number occurrence, shown on Table 1.

Logistic Regression Models
After that PCA composition based on SST regions, we apply the cross correlation function (CCF) to identify lag of correlations between precipitation and the other variables to extract the lags.Then the logistic regression model was applied following two important criteria: 1) Given a set of independent variables, the propose is estimate the probability of precipitation extreme occurrence; and 2) To assess the magnitude of the influence of each meteorological variable on precipitation extremes obtained by odds ratio (OR).The logistic regression is expressed: g(x) is precipitation extreme in dichotomous form (between 0's e 1's), and p is the precipitation extremes occurrence probability, given by:

Odds Ratio
Odds ratio (OR) was calculated for each variable, obtaining the association magnitude between precipitation extremes and the meteorological variables.For calculate the OR should get the odds, which is the natural measure more important in logistic regression and can be interpreted as the ratio between the odds of the precipitation extremes to occur to the odds of precipitation extremes not to occur.Both odds are dimensionless and non-negative, if the OR < 1 is described as exposure factor, the observed variable is not influence the precipitation extremes, while the OR > 1 is described as risk factor, thus the observed variable influence on the precipitation extremes.Thus, the OR depends of four probabilities that following: which F = 1 when the observed variable influences on precipitation extremes occur, F = 0 otherwise, P = 1 when the precipitation extreme occur, and P = 0 otherwise.

Goodness-of-Fit and ROC Curve
For goodness-of-fit (GOF) analysis [30] was used three methods: Deviance residual, AIC and p-value.Deviance Residual is the quality-of-fit statistic measure based on maximum likelihood using the sum of squared residuals in ordinary least squares.Akaike Information Criterion (AIC) is quality-of-fit measure wherein seek to select variables given a joint of variables that optimize the performance of the model with the minimum AIC value.Already the p-value is other measures that verify whether the variables contained on model has significance statistical, generally it used p < 0.005 value for reject the null hypothesis.For assess the accuracy of model it was used receiver operating characteristics (ROC) graph.
The ROC curve is a technique for visualizing, organizing and selecting classifiers, thus it evaluates the quality or performance of diagnostic tests [31,32].
Generally, the ROC analysis assesses the quality of model counting of occurrence or not precipitation extremes and the exposure factor presence or absent at an extreme condition.Thus, the common measure used is Area under Curve (AUC) that interpret the average value of sensibility for all values of specificity with aim to evaluate the overall performance of a diagnostic test, ranging between 0 and 1, wherein a bigger value suggests the better overall performance of a diagnostic test [31].All results computations were generated on R software [33] with packages support: MASS [34], ROCR [35] and Epi [36].

Results and Discussions
The logistic regression model results, OR's and goodness-of-fit analysis (deviance residual, AIC, p-value) are shown on Tables 2-5 for four NEB sub-regions (NNEB, ENEB, SANEB and SNEB) shown on Figure 2. The 5800 daily precipitation data upper >95 p were considered, corresponding 290 -295 precipitation extremes cases obtained.
For goodness-of-fit was verified that the SANEB (Table 3) sub-region obtained the best values of AIC with 1386.1 and deviance residual of 1366.1.
The ROC curve analysis shown on Figure 3 was observed that all models showed were above 0.80, which shows that all the models have a good predictive ability,    highlighting again the SANEB (Figure 3) sub-region that exhibit the best AUC value of 0.935.The measure associations by Odds Ratio, on NNEB (Northern of Maranhão) there is evidence that the variables that contribute on precipitation extremes are EV, OLR, TX.lag 1 and OLR.lag 1 ; for the ENEB (Northeastern of Alagoas) are RH, OLR, TX and TX.lag 1 ; already on SANEB (Southwestern of Bahia) are OLR, TX.lag 1 ,  OLR.lag 1 and OLR.lag 1 ; and on SNEB (Southeastern of Bahia) are RH, OLR, SHUM and CompU.lag 1 .
The OR results corroborate with [37], wherein the authors detected the OLR and extreme precipitation relationships on tropics using climate indices (rain > 10 mm and OLR < 180 Wm −2 ), this indices indicated that association favoring the convection formation by low-level moisture convergence causing the precipitation more intense suggested by [38].It is noted that the precipitation extremes events is well distinct for each NEB subregion favored by several meteorological variable associated meteorological systems corroborating with [39] that describe is not only temperature has a cause-effect influence on precipitation intensity, but for a combination of different meteorological systems.
Analyzing the sub-regions in separate, the precipitation extremes on NNEB (Table 2) is linked at ITCZ displacement in north-south direction that transport heat and moisture into region, subside by TX and EV combined.On SANEB (Table 3), scarcity precipitation region in NEB, it was verified that TX.lag 1 , CompU, OLR and OLR.lag 1 favor precipitation extremes boosted by temperature and moisture advection associated at easterly flow forming of deep convection caused by frontal systems [20] or north axis SACZ displacement [25] that penetrating on southern NEB.
On ENEB (Table 4), this extremes are strength by RH, TX and TX.lag1 combined with CompU.lag 1 , according to [13] the intense precipitation occurrence is favored for heat and moisture transport by easterly waves disturbances and boost the MCS and MCC formation about region which are largely responsible for maintaining the precipitation regime, contributing with 50% -70% of annual regime.On SNEB (Table 5), the TX.lag 1 , U.lag 1 and OLR were influenced in precipitation extremes arising from different meteorological systems: eastern influence-breeze systems and easterly wave disturbance, and southern influence-SACZ and frontal systems.

Conclusion
These initial conclusions show that Atlantic and Pacific SSTs in daily timescale do not significantly favor on precipitation extremes.The OLR is a key variable in extreme precipitation detection.The projection for future work is to evaluate the extreme precipitation occurrence in Northeast Brazil by relative risk via Poisson regression seeking to detect the behavior by count process.

Figure 2 .
Figure 2. Meteorological variables that influenced on NEB precipitation extremes by logistic regression models.