Calculation of Compound Hazard Probability Using Convolution of Distributions ()
1. Introduction
The hazard produced by natural phenomena on infrastructure and urban populations has been widely studied for the last 50 years. Researchers have recognised that the real danger posed by these phenomena depends on their extreme values [1]-[6]. The study of extremes of natural phenomena was boosted with the development of the Statistics of Extreme Values in the 1950s. This branch of mathematics allows the analyst to study extremes of natural phenomena in a systematic way [7]-[9].
Most researchers focus on the extremes of natural phenomena considered in isolation, one variable at a time. However, what is relevant in hazard studies is the coincidence of extremes of several variables, i.e., the presence of compound extremes. In general, compound extremes of climatic variables are defined as concurrent or coincident extremes of multiple variables in a given region at a given time. Compound extremes may exacerbate an already dangerous situation, leading to a more significant impact on humans and the built environment than extremes of single variables [10]-[13]. Often, the combination of extremes increases the hazard; [14] and [15] show that if severe winds destroy the roofs of buildings, the concurrent heavy rain causes more damage. In some situations, the value of a single variable is not dangerous in itself, but the combination of several climatic events could give rise to an extreme event. [16] show that traditional univariate risk assessment substantially underestimates the risk of extreme events such as the 2014 California drought because of ignoring the effect of temperature. [17] show that the drought that occurred in Europe in 2003 was not the most severe in the continent; however, in combination with extended heatwaves and wildfires, it is considered the most fatal, with more than 70,000 people dying as a result of these extreme conditions.
In other cases, the problem itself depends on several extremes of climatic variables, such as the study of coastal erosion due to sea storms, which depend on wave height, wave period, storm duration, wave direction, water level and storm-interarrival time [18]. Wildfires, a severe problem in south-eastern Australian states, depend on high-temperature values, wind speed and drought; the latter is defined as extremes of low precipitation. If, together with these climatic variables, there is also a lot of fuel in a given region (in the form of dry grass), there is a high probability of a wildfire developing in the region [19]-[21].
Failure to model compound events correctly results in an inaccurate assessment of the hazard, with severe consequences for the infrastructure. [22] show that considering the compound effect of sea-level rise, groundwater inundation, and precipitation flooding, the drainage capacity of a coastal watershed in California (USA) will be breached, and a large region will be flooded, whilst the univariate model fails to recognise the danger. Similar results are reported by [23].
One limitation of most compound hazard models is that they do not assess the probability of concurrent extremes occurring. In some cases, this probability is too low, and the possible impact of these phenomena can be neglected with savings for the planning authorities. In other cases, the likelihood of concurrent events happening is significant, especially considering future climate conditions and must be included in the model [24]-[26].
What is essential in all these cases is to calculate the probability that two or more extreme values of climatic variables, not necessarily their peak values, occur concurrently, [27] studied the impact of typhoons, rainfall, and storm surges, three correlated variables produced by the landing of a cyclone. Floods quickly occur in coastal cities when cyclones land in these regions, often bringing heavy rainfall and storm surges. They model the phenomenon using a trivariate joint probability distribution of the three elements: typhoon landing, rain and storm surge. The joint probability distribution can quantify the probability that extremes of multiple variables arise simultaneously.
Modelling multivariate extremes is more complicated than the single-variate case. In particular, in the former, it is necessary to consider the correlation between variables; in most cases, the assumption of independence that underpins the single variate case is no longer applicable [15] [28]-[30].
In general, there are two methodologies to model multivariate extremes; one is by using a multivariate extreme value distribution [1] [31]-[34], and the other is by performing the convolution of the individual random variables to obtain the distribution of the combined variables. The latter case can be greatly simplified because the combined distribution can be analysed using the tools of the univariate case [35].
This study uses the latter methodology: it convolves the individual random variables to obtain the combined random variable. The convolution is performed using the method of cumulants [36] [37]. The tail of the distribution of the combined random variable determines the extreme behaviour of the combined distributions, and this tail can be modelled using the Statistics of Extreme Values. In particular, we present a flexible, efficient methodology to calculate the probability that compound extremes of “n” random variables of natural phenomena occur, considering the possible correlation between these random variables.
This probability can be transformed into an Average Recurrence Interval (ARI), the indicator of importance in hazard and risk studies [1] [28] [38]. The ARI termed Return Period (RP) in engineering studies allows the planner to calculate the magnitude of extreme values of natural phenomena and the frequency with which these values repeat.
The paper’s layout is as follows: Section 2 presents the mathematical basis of convolution. Section 3, discusses the method of cumulants to calculate the curves of RP of compound events using the convolution of distributions. Section 4 presents an example problem to illustrate the methodology, and Section 5 analyses the results.
2. Mathematical Convolution
The combination of the distributions of several random variables (rvs) is termed a mathematical convolution. In the general case, the operation gives rise to complex integrals that are difficult to solve analytically. In practice, the solution is achieved using transformations [39] [40] or numerical methods [37] [41]. To illustrate the process, consider two simple examples: In the first one, we will convolve the discrete distributions of two independent random variables X1 and X2. In the second example, we convolve two independent Normal distributions
and
. The convolution of two independent Normal distributions is an easy operation because Normal distributions have the reproductive property, i.e. the combination of two Normals produces another Normal with parameters (m1 + m2) and (
).
2.1. Convolution of Discrete Distributions
For the first case, consider the distributions of wind speed in m/s recorded at two independent weather stations. The discrete distributions of the rvs, called X1 and X2, are shown in Figure 1. The distribution of X1 has only two impulses (0.3, 0.7) located at (3, 5). The distribution of X2 has 3 impulses (0.2, 0.6, 0.2) located at (3, 6, 8). The convolution of both rvs, X1 and X2, is defined as
(1)
and is performed numerically by adding together the x-axis and multiplying the corresponding probabilities, as shown in Figure 1 (bottom). The convolution, in this case, implies shifting the distribution to the right and reducing the size of the impulses. The realisation of events X1 = 5 m/s and X2 = 6 m/s occurring concurrently is given by
(2)
to stress the point, Figure 1 shows the x-axis of the individual rvs below the main x-axis. Discrete distributions have been used to model landslide hazard [42].
Figure 1. Convolution of two discrete distributions.
2.2. Convolution of Continuous Distributions
The independent Normal distributions are shown in Figure 2. The parameter of Normal 1, N1 (blue curve), are (2,1), i.e. mean (m1) = 2 m/s, and Variance (Var) = 1 (m/s)2. The parameters of Normal 2, N2 (red curve), are (3.5, 0.25). The corresponding convolution is a Normal with parameters (5.5, 1.25). Again, observe that the net effect is a shifting of the resulting distribution (black curve) to the right and a reduction of the body of the distribution (because of the combination of probabilities). In this case, showing the combination of the individual x-axis is more challenging because the distributions are continuous. Still, as in Figure 1, we show a few points of the original distributions below the main x-axis.
The equivalent to Equation (2) in continuous distributions requires the cumulative distribution function (CDF) shown in Figure 3. In this case, the probability of events N1 < 3 m/s and N2 < 4 m/s occurring simultaneously is given by
(3)
[41] and [37] present more examples of convolutions.
In risk analysis, we are more interested in the complementary event
.
Figure 2. Convolution of continuous distributions.
Figure 3. Distribution Function of rv Z.
If z1 is large, we are selecting the extreme values of Z, our compound variable. The value
represents the probability that the concurrent extremes of N1 and N2 are exceeded. The hazard and risk studies of natural phenomena aim to calculate the frequency with which extreme values repeat. As mentioned before, the main indicator in these types of studies is the Return Period (RP). A curve of RP shows the recurrence of extreme values, usually in years, at a given region. These extreme values are termed Return Levels. Thus, if the 50-year RP of wind speed at a given region is 30 m/s, what we are saying is that 30 m/s is expected to be exceeded, on average, once every 50 years.
The curve of RP can be built from Figure 3 ([1] [2]), using the expression,
(4)
where RP is the Return Period corresponding to the z1 Return Level. The constant “nopy” sets up the right scale for the curve and depends on the physical characteristics of the problem.
Figure 4. Fitting the GPD to the tail of the Compound distribution.
The major limitation of this technique is that it allows us to calculate RPs for the same period of the given data, usually 20 or 30 years. In practical problems we are interested in calculating RP for values well beyond the available data. The Australian/NZ regulations [43], for instance, prescribe wind loads for design of structures corresponding to return periods of 500 years. In this case we should use extreme value distributions to extrapolate RPs to 50, 100, 500 years and beyond. The technique consists in fitting an extreme value distribution to the tail of the distribution of the compound variable (black curve in Figure 2). There are a number of families of extreme value distributions available to do this, in our example we will use the Generalised Pareto distribution (GPD), and we will fit the GPD using the method of moments [44] [45]. Figure 4 shows the fitting of the GPD to the tail of the compound extremes (green curve). From the parameters of the GPD we can calculate the parameters of the RP curve, as will be explained in Section 3.2.
More information has to be provided to calculate the curve of RP. Suppose that the Normal distributions of Figure 2 refer to the maximum daily wind speed (m/s) recorded at two independent stations in Australia. The records cover a period of 20 years for each station; hence the convolution of the two distributions refers to a period of 40 years. Furthermore, suppose that the maximum value found in the records for each Normal distribution is 5.5 and 6.5 m/s, respectively. The maximum of the combined distributions will be 12 m/s. Hence, we can say that the 40-year RP of max daily wind speed at the region is 12 m/s, the correct scale to use in Equation (4) is nopy = 9.44. Figure 5 shows the corresponding curve of RP. From this curve, we can say that the combined events, wind speed at station 1 = 5.5 m/s and wind speed at station 2 = 6.5 m/s occur, on average, once every 50 years, as shown by the scales to the left of the y-axis.
Note that this is just an example; in practice, it is inappropriate to fit Normally-distributed data with extreme value distributions [2].
3. Methodology to Calculate Curves of RP of Compound Events
This paper aims to present an efficient methodology for calculating curves of RP of compound events. This methodology is based on the method of cumulants to convolve any number of independent rvs. The technique can easily be extended to the case of non-independent rvs, as we will see later.
3.1. Convolution Using the Method of Cumulants
In this method, the rvs are replaced by a set of parameters called cumulants. These cumulants are linear combinations of the moments of the distribution. The convolution consists on adding together the cumulants of the components to obtain the cumulants of the resulting distribution. From these cumulants the distribution of the resulting rv can be calculated [37]. Consider the linear equation,
(5)
where,
a, b, c = constants.
X, Y = Independent random variables.
Z = Resulting random variable.
Let mx and my be the mean of X and Y; and let κr(x) and κr(y) be the cumulants of X and Y respectively. The subscript “r” indicates the order of the cumulant.
In terms of moments and cumulants Equation (5) can be expressed as,
(6)
for cumulants of order r > 1 (7)
Note that the operation does not involve any approximation. Equation (6) and (7) can be collapsed into one [37].
One of the strengths of the method of cumulants for convolution of rvs is that the process can easily be extended to the case of non-independent rvs, i.e. for the case in which there is some degree of covariance between the rvs. The covariance indicates whether both variables follow the same pattern of change. If X tends to increase or decrease along with Y, the covariance will be large and positive. If, on the other hand, X tends to increase as Y decreases their covariance would be a large negative number.
Let us suppose that the covariance between the rvs X and Y is σxy, then the second cumulant or Variance of rv Z has to be modified to take this covariance into account. Equation (7), then, becomes,
(8)
3.2. Curves of RP
There are basically two techniques to calculate the RP. One is the so-called “block maxima” in which an extreme value distribution is fitted to the annual maxima. The distribution used in this case is the Generalised Extreme Value Distribution (GEV). If more observations are available, for instance, maximum daily observations, “block maxima” is not recommended because it uses only one observation per year; the other observations are not used, which may result in a poor fitting of the data. The second technique is the “peaks-over-threshold”. This technique utilises all values over a given threshold to fit the distribution. In this case, the Generalised Pareto Distribution (GPD) is used. This methodology has several advantages over the “block maxima” method; firstly, it uses a lot more data to fit the distribution, and secondly, by setting the threshold high enough, the data will be better distributed in time, improving the likelihood that the data samples are independent of each other (one of the conditions of EV applications).
The latter is the method recommended when maximum daily observations are available [1] [46]. One of the problems in fitting the GPD is the selection of the appropriate threshold. Return period calculations using GPD distributions are very sensitive to the threshold selection.
In this study, we use the latter technique and calculate the appropriate threshold u using the automatic procedure developed by Sanabria and Cechet [47].
The parameters of the GPD distribution, location μz, scale σ and shape ξ can be calculated from the moment mz and the second cumulant of Z (κ2) using the expressions [44] [45],
(9)
(10)
(11)
From these parameters the curve of RP follows ([1]), see Figure 5,
(12)
where,
RLj is the Return Level corresponding to a given Return Period, RPj, u is the threshold, nopy is the scale factor and crate is the cross rate or proportion of excesses over the threshold. The y-axis in Figure 5 is wind speed in m/s.
Figure 5. RP of the Compound distribution (Resulting rv Z).
4. Example
As an illustrative example, consider the well-known problem of a landing cyclone near a low-gradient coastal city located on the estuary of a river. In this case, the city is vulnerable to flooding hazards from intense rainfall and the cyclone’s coastal storm surge. From a meteorological point of view, at least five drivers impact the hazard: Precipitation, wind speed, sea-level rise, astronomic tide and central pressure [22] [27] [48]-[50].
Wind speed can produce a storm surge, which, combined with sea level, can substantially flood the region with ocean waters, especially at high tide, more so if sea level rise due to climate change is considered. Furthermore, the river discharge in the sea, which depends on precipitation, can exacerbate the flood hazard in the region. The five drivers referred to above can be collapsed into three climatic variables: Wind speed, which drives the storm surge. Precipitation may flood the area, depending on its magnitude and topography. However, the flooding will likely be due to the heavy river discharge into the sea. The compound effect of the river discharge and storm surge increases the hazard. Finally, the sea level; if the cyclone landing occurs at high tide or the region is already affected by sea level rise due to climate change, the likelihood of the flood increases. The cyclone’s central pressure can aggravate the situation. We will use sea level as a proxy for all these complex interactions to simplify the example.
What is essential in this case is to assess the probability that the extremes of the three phenomena, wind, precipitation and sea level, occur concurrently. Very seldom do the peak values of these events coincide; however, the combination of several off-peak extremes can build up to produce a hazardous situation [17]. What is necessary is, then, to calculate the probabilistic distribution of the combined extremes.
Data Description
We acquired data from the Australian Bureau of Meteorology [51] for this illustrative example. For wind, we have maximum daily wind gust in m/s. For precipitation, we have cumulative daily values in mm. In both cases, the data covers the period 2006 to 2019 (14 years). For sea level, we have 40 years of maximum monthly sea level in m (1981-2020). The recorded data stations are located in the same region (greater Sydney). Referring to the three phenomena to a common axis for the cumulant-based convolution process is convenient. This can be achieved by expressing the variables per unit (pu), and dividing each one by the maximum value found in the respective period. Table 1 shows the maximum value of the three phenomena. Figure 6 and Figure 7 show the histogram and density function of the wind speed and precipitation distributions, respectively. For the latter, we have selected values > 4 mm. The histogram and density function of sea level is presented in Figure 8. The distributions are assumed to remain static for the study period.
Table 1. Maximum value of the variables used in the example.
Wind speed (m/s) |
Precipitation (mm) |
Sea level (m) |
33.4 |
106.8 |
2.3 |
5. Results
The first step in the calculation of the distribution of the convolution of the three rvs shown in Figures 6-8, is to calculate the first moment and the second cumulant of each distribution as explained in Section 3.1. These parameters are presented in Table 2. The next step is the calculation of the parameters of the extreme value distribution that best fits the tail of convolution distribution as explained in Section 3.2 see Figure 4. These parameters are calculated using expressions 8 - 10 and are presented in Table 3. The final step is the calculation of the corresponding curve of RP using Equation (12) as shown in Figure 9.
Figure 6. Histogram and density function of wind speed in pu.
Figure 7. Histogram and density function of precipitation in pu.
Figure 8. Histogram and density function of sea level in pu.
Figure 9. RP of the Compound distribution in pu (all 3 rvs).
Table 2. Parameters of each variable in pu.
Variable |
mean (m1) |
2nd cumulant (κ2) |
threshold (u) |
wind speed |
0.394 |
0.0139 |
0.475 |
precipitation |
0.270 |
0.0277 |
0.300 |
sea level |
0.845 |
0.0028 |
0.903 |
variable Z (convolution) |
1.509 |
0.0445 |
1.678 |
Table 3. Parameters of the GPD which best fit the convolution distribution in pu.
mean |
location |
scale |
shape |
3.02 |
1.51 |
39.43 |
25.12 |
Table 4. Covariance between random variables (in pu).
Variable |
Wind speed |
Precipitation |
Sea level |
wind speed |
- |
0.000368 |
−0.000165 |
precipitation |
0.000368 |
- |
0.000468 |
sea level |
−0.000165 |
0.000468 |
- |
Figure 9 shows that the compound event wind = 1.05, precipitation = 1.1 and sea level = 1 (first tick mark in the left axes) or in actual units: wind speed = 35.7 m/s, precipitation = 117.5 mm and sea level = 2.32 m can occur, on average, every 9 years. Now, suppose that we know that the compound event wind speed = 36.4 m/s, precipitation = 145.25 mm and sea level = 2.71 m (or in pu: wind = 1.09, precipitation = 1.36 and sea level = 1.17, last tick mark in the left axes) produces a heavy flooding of the city. From the RP curve we can see that the event has a 500-year RP, that is, the city is subjected to these types of flooding, on average, every 500 years; in other words, the annual probability of flooding is very low: 0.002.
Non-Independent rvs
As explained in Section 3.1, the method of cumulants for convolution of rvs can be extended to non-independent rvs by modifying the calculation of the second cumulant. Table 4 shows the covariance of the three rvs used in this example. Compared with the second cumulant (in pu) shown in Table 2, we can see that the covariance is relatively large, and hence, it must be considered for a proper modelling of the problem. The table shows that the covariance is strongest between sea level and precipitation. The positive value shows that the sea level tends to rise with precipitation while decreasing with wind speed.
To compare the curves of RP considering independent and non-independent rvs, both curves have been plotted in a single graph shown in Figure 10. This figure shows that the curve of RP considering non-independence between rvs (red curve) is lower than the one considering independence. For the 100-year RP, the return level for the former is about 4.75% less than that of the latter. For larger RPs the difference tends to increase. Consider again the compound event wind speed = 36.4 m/s, precipitation = 145.25 mm and sea level = 2.71 m. It used to have a 500-year RP; now it has a 1000-year RP. In other words, considering the covariance between rvs, the event’s recurrence happens over a larger interval. This shows the importance of modelling the problem correctly. For mitigation purposes, the overestimation of the RP could result in a substantial increase in the cost.
Figure 10. Curve of RP of both independent and non-independent rvs.
6. Conclusions
In this paper, we present a new methodology to assess the probability that extremes of natural phenomena occur concurrently. In the last few years, researchers have recognised that the real danger natural phenomena pose to people and the infrastructure is the extremes of several climatic variables coinciding. In most cases, the peak values of these variables are not coincident; still, other values located in the tail of the distributions can co-occur, leading to a compound extreme, which can result in a catastrophic event. The problem is complicated because the correlation between the climatic variables strongly influences the probability. From the probability of compound events occurring, it is possible to calculate the Return Period, the indicator used to assess hazard and risk in natural phenomena.
Our methodology allows the analyst to deal with any number of correlated climatic variables in a computationally efficient and mathematically robust way. This facilitates the study of the impact of climate change on natural disasters.
Data and Software
The data used in the example of Section 4 was provided by the Australian Bureau of Meteorology [51]. (Dataset “DC02D_Data_066037_48502949795856.txt”). The tide data was downloaded from “bom.gov.au/australia/tides/”. The data processing and figure generation were carried out using the R environment for statistical computing [52].