Non-Stationary Random Process for Large-Scale Failure and Recovery of Power Distributions

A key objective of the smart grid is to improve reliability of utility services to end users. This requires strengthening resilience of distribution networks that lie at the edge of the grid. However, distribution networks are exposed to external disturbances such as hurricanes and snow storms where electricity service to customers is disrupted repeatedly. External disturbances cause large-scale power failures that are neither well-understood, nor formulated rigorously, nor studied systematically. This work studies resilience of power distribution networks to large-scale disturbances in three aspects. First, a non-stationary random process is derived to characterize an entire life cycle of large-scale failure and recovery. Second, resilience is defined based on the non-stationary random process. Close form analytical expressions are derived under specific large-scale failure scenarios. Third, the non-stationary model and the resilience metric are applied to a real life example of large-scale disruptions due to Hurricane Ike. Real data on large-scale failures from an operational network is used to learn time-varying model parameters and resilience metrics.


I. INTRODUCTION
Our power grid is a vast interconnected network that delivers electricity from suppliers to consumers. At the edge of the grid lies power distribution networks [1]. Power distribution networks provide medium or low voltages to residence and organizations, and thus serve a unique role of connecting the grid to end users.
Distribution networks consist of "leaf nodes" of the energy infrastructure and are thus susceptible to external disturbances. For example, natural disasters repeatedly cause devastating destructions and service disruptions to distributions networks [2] [3]. There were 6 major hurricanes and more than 10 snow and ice storms occurred in north America in the past 5 years [4]. Each natural disaster disrupted power services to more than 500,000 customers for days [4]. Large-scale power outages are more prevalent and damaging in developing countries, whose energy demands are rapidly increasing but power infrastructures are still in development [5].
A fundamental research problem pertaining to the real problem is the resilience of power distribution to large-scale external disruptions. Resilience corresponds to the ability of distribution networks to withstand external disturbances and to recover rapidly from failures. Up to date, tremendous attention has been directed to resilience of the core that consists of major power generation and transmission systems of high voltages [6] [7] [8]. As an external disturbance often affects power distribution networks in a wide geo-graphical span, the service disruptions usually remain local at the edge [9]. The resilience of the edge, however, is understudied [10]. As the demand on energy is growing, the edge of the grid is becoming thicker. For example, a utility provider often serves millions of customers in America. Damages from external disturbances on power distribution can thus result in a profound impact to a large number of users. Hence, the issue of resilience of power distribution networks is much needed to be investigated.
Empirical studies have been conducted on how to access damages from large-scale power outages (see [11] and reference therein). Monitoring systems have been developed and used in power industry to respond to failures due to natural disasters (see [12] as an example). These empirical methods predict the degree of damages, e.g., the maximum number of outages upon a natural disaster [2], often through observations and experiences. Quantifiable approaches are lacking and needed to leverage practical experiences to general principles on resilience of power distribution [5] [13] [10]. In other words, resilience, as a concept about an overall power distribution network, needs to be learned systematically from real data.
Two research challenges emerge. One is what to learn from real data. Real data consists of samples on responses of a distribution network to external disturbances. External disturbances such as hurricanes often occur suddenly and unpredictably. The resulting large-scale failure and recovery at distribution networks thus exhibit random and dynamic behavior. For example, recovery depends on multiple random factors such as environmental conditions, available resources and preparation. Prior work models network failures and recoveries using finite state Markov processes [14]. These models belong to the general birth-and-death process [15] which include randomness but assume stationary failure and recovery. Non-stationary failure such as a non-homogeneous Poisson point process is provided in the context of M t/G/∞ queues [16] [17] with time varying Poisson parameters. General time-dependent infinite-server arrival/service processes are studied in [18] [19]. However, non-stationary recovery is rarely included in network resilience. In fact, it is an open issue how to characterize large-scale non-stationary failure and nonstationary recovery for power distribution networks.
The second challenge is how to define and characterize resilience for an entire non-stationary life-cycle of failure and recovery. Prior work provides general discussions that resilience should include multiple attributes, such as both failure and recovery [20] [21] [22]. However, most prior works mainly study resilience in terms of maintaining services, and thus consider failures only [23]. In fact, recovery is rarely included in defining resilience of power distribution networks.
It is an open issue how to derive resilience metrics for characterizing non-stationarity in both failure and recovery.
Hence it is necessary to formulate, from ground up, an entire life cycle of large-scale failure and recovery. This can prevent us from choosing a model and/or a learning approach subjectively. We first develop a problem formulation of large-scale failures at the finest level of network nodes based on temporal-spatial stochastic processes. However, as an individual external disturbance results in one "snapshot" of network responses, information (data) from one snapshot is insufficient to completely specify such temporal-spatial model [24]. We thus derive temporal models of an entire distribution network by aggregating spatial variables. The resulting temporal process models an entire non-stationary life-cycle of largescale failures. The model applies to general failure process that can be dependent and with an arbitrary distribution. Two distinct recovery-characteristics emerge from our model. One is infant recovery that reflects the ability to recover rapidly from failures. The other is aging recovery that corresponds to prolonged failures. We define the resilience as the probability of infant recovery for a power distribution network. We then derive analytical expressions for special cases of failure and recovery.
The model and the resilience metric are studied in a real life example of large scale service disruptions of power distribution. Power failures occurred during a major natural disaster, Hurricane Ike in 2008. Pertinent resilience parameters are learned using real data from an operational distribution network.
The rest of the paper is organized as follows. Section II provides background knowledge and an example of largescale failures at power distribution networks. Section III develops a problem formulation of failure-and-recovery processes. Section IV characterizes non-stationary failure and recovery individually and jointly; then derives analytical expressions for special cases of failure and recovery. Section V defines network resilience based on the non-stationary model. Section VI learns pertinent resilience parameters using large-scale real data. Section VII discusses our findings and concludes the paper.

II. BACKGROUND AND EXAMPLE
A power distribution network is at the edge of the grid from substations to users. A power distribution network consists of components including substations, feeders, transformers, poles, and transmission lines, and meters.
A large number of such devices in a distribution network are often in the open, and thus susceptible to natural disasters such as hurricanes, ice and snow storms. For example, a fallen pole can cause a short circuit, and other devices to fail subsequently. An external disturbance such as a hurricane can cause a large number of failures. The failures interrupt electricity service to end users. A large-scale external disruption can affect one or many distribution networks in a wide geo-graphical area. Failure recovery is often done by dispatching crews to the field. Promptness of failure recovery thus depends on environmental constraints, preparedness, and resources.
To gain intuition on an entire life cycle of failure and recovery, we consider an example of large-scale power failures occurred during Hurricane Ike. Hurricane Ike had a landfall in 2008 and affected densely populated areas in Texas and Louisiana. Figure 1 shows a histogram on failure occurrence time and duration at an operational distribution network before, during and after the hurricane. The example provides the following observations: (a) Both failure occurrence and recovery are time-varying, i.e., non-stationary.
(b) Samples on failure occurrence time and duration are not identically distributed. Instead, recovery time characterized by failure duration is different for failures occurred at different time.
This shows that failure occurrence and recovery are statistically dependent, and non-stationary.

III. STOCHASTIC MODEL
We formulate large-scale failure and recovery based on non-stationary random processes. We begin with the detailed information on nodal statuses in a distribution network. We then aggregate the spatial variables of nodes to obtain temporal evolution of failure and recovery of an entire network.

A. Failure and Recovery Probability
A temporal-spatial random process provides a theoretical basis for modeling large-scale failures at the finest scale of nodes. The temporal variable is time t. The spatial variable is the location of a node in the grid. Here, nodes can be any components in a distribution network such as substations, feeders, hubs, transformers, transmission lines, and power circuits. A shorthand notation i is used to specify both the index of node i and its corresponding location, where i ∈ S = {1, 2, ..., n} for a power distribution network with n nodes.
Let X i (t) be the status of the i-th node at time t > 0 for 1 ≤ i ≤ n. X i (t) = 1 if the i-th node is in a failure mode. X i (t) = 0 if the node is in normal operation. For a distribution network, an example of a node failure includes a failed circuit, a fallen pole, a broken link, and an non-operational substation.
Failures caused by external disturbances exhibit randomness. Whether and when a node fails is random. Whether and when a failed node recovers is also random. Hence, random processes can be used to characterize failure and recovery for all nodes in a network.
Given time t > 0, P {X i (t + τ ) = 1} characterizes the probability that node i is failed in the near future t + τ , where τ > 0 is a small time increment. Assume a node changes state, i.e., from failure to normal and vice versa. Then for the ith node, 1 ≤ i ≤ n, Eq.1 provides a model for an individual node in a network. The model includes Markov temporal dependence and spatial dependence among nodal statuses. Such a model can be applied to a heterogeneous grid where nodes experience in general different failure and recovery processes. There are n such temporal-spatial equations for n nodes in a distribution network. The n equations together form a temporal-spatial model for a network.

B. Temporal Process
When large-scale outages are caused by individual external disturbances, information available is from "snapshots" of temporal spatial network statuses. A snapshot corresponds to spatial-temporal nodal statuses with respect to one external disturbance, e.g., one hurricane. As there are usually only a few such snapshots available, information is insufficient for specifying a complete temporal spatial model at the node level. However, spatial variables can be aggregated out from Eq.1, making the information sufficient for temporal characteristics, where Furthermore, the probability can be related to an indicator function, Then we can define a temporal process as follows.
Definition: A temporal processes {N (t) ∈ N, t > 0} is a special case of the temporal-spatial process where the spatial variables (i's) are aggregated for all nodes in a network. N (t) is the number of nodes in failure state at time t, where I(A) is an indicator function, i.e., I(A) = 1 if event A occurs, and I(A) = 0 otherwise.
Combining Equations 2, 3, and 4, we have, where Hence, an expected increment of the number of failed nodes in a network equals to the total change of the aggregated probabilities on nodal statuses. An increment ∆N (t) in the total number of nodes in failure state can result from either newly failed or newly recovered nodes. To further characterize the temporal process N (t), we define a failure process and a recovery process respectively.
Definition: Failure and recovery processes: Failure process Assume τ > 0 is sufficiently small so that at most one failure occurs during (t, t + τ ), then an increment on the number of failures satisfies where Similarly, for a sufficiently small τ , it can be assumed that at most one recovery occurs during (t, t + τ ). Then, Furthermore, we assume at time t 0 = 0, N (t) = 0, N f (t) = 0, and N r (t) = 0. Aggregating increments in Eq.8 from 0 to t, we have, Hence, the expected number of nodes in failure state equals to the difference between the expected failures and the expected recoveries. This characterizes how the status of a distribution network changes from the present to the near future.
For practicality, the empirical processes as the sample meanŝ N (t),N f (t), andN r (t) can be used to estimate the true expectations E{N (t)}, E{N f (t)}, and E{N r (t)}, respectively. Eq.9 can then be represented by the empirical processes, The empirical processes and the failure-and-recovery equation allow learning from field data, which shall be elaborated in Section VI.

IV. NON-STATIONARY FAILURE AND RECOVERY
We now focus on the temporal processes to derive nonstationary characteristics on failure and recovery individually and jointly. We derive close form expressions for special cases of failure and recovery processes. Our study reveals pertinent parameters which shall be used to quantify resilience in Section V.

A. Failure Process
Let λ f (t) be the intensity function of the failure process. λ f (t) is the expected number of new failures per unit time at epoch t, i.e., is also referred to as the rate function of the failure process N f (t).
The larger λ f (t) is, the more failures occur in a unit time duration. Hence, failure rate quantifies the intensity of failure occurrence. An non-stationary failure process has a time-varying intensity function λ f (t). Assuming a failure process begins at t = 0, we have The probability density function f (t) of failure occurrence time at t can be obtained from λ f (t), where As a special case of a general failure process, λ f (t), as the first moment, can completely determines the failure process N f (t). Consider an independent failure process where the number of failures N f (t) is a counting process with independent increment ∆N f (t). Among many such random processes, non-homogeneous Poisson process (NHPP) [15] captures the non-stationary nature of failures in a parametric form. The parameter is the intensity function, λ f (t). The nonstationarity refers to a common characteristic of large-scale external disruptions where power failures occur at different intensity at different time. The definition of a NHPP, applied to a failure process, is provided below.
is a Non-Homogeneous Poisson Process with a time-varying rate function λ f (t), t ≥ 0, if N f (t) satisfies the following conditions: In Section VI, we shall show using real data that a failure process {N f (t)} from a hurricane follows a NHPP with λ f (t) as the failure rate function.

B. Recovery Process
We now define intensity function λ r (t) for a recovery process. λ r (t) is the expected number of new recoveries per unit time at epoch t, λ r (t) is also referred to as the rate function of the recovery process N r (t).
An non-stationary recovery process has a time-varying intensity function, i.e., λ r (t). Assuming the temporal failure process begins at t 0 = 0, we have The recovery rate characterizes how rapidly recovery occurs, which is measured by failure duration D. For an nonstationary recovery process, a failure duration depends on when a failure occurs as illustrated in Figure 1. Such nonstationarity of recovery is characterized by g(d|t) which is a conditional probability density function of failure duration D given failure time T . For a given threshold d 0 > 0, the conditional probability that a duration is bounded by d 0 for failures occurred around time t is where D is a random failure duration. When d 0 is sufficiently small, this probability characterizes rapid recovery that occurs shortly after failures. For a given d 0 , the larger P {D < d 0 |t} is, the more dominating the rapid recovery is. Given desired value of probability P {D < d 0 |t}, the smaller d 0 is, the more dominating the rapid recovery is.
Rapid recovery is referred to as infant recovery. This terminology is borrowed from infant mortality in survivability analysis [25]. Infant recovery is a desirable characteristic of the smart grid. In contrast, slow recovery is referred to as aging recovery in analogous to aging mortality [26]. Infant and aging recovery can be formally defined as follows.
Definition: Infant and aging recovery: Let d 0 > 0 be a threshold value. If a node remains in failure for a duration less than d 0 ; a recovery is an infant recovery. Otherwise, the recovery is aging recovery. Infant recovery is characterized by Note here P {D < d 0 |t} is a function of failure occurrence time. As we shall show through a real-life example in Section VI, failures occurred at different time may experience infant and/or aging recovery of different degrees, showing the nonstationarity of a recovery process.

C. Joint Failure-and-Recovery Process
A joint failure and recovery process characterizes an entire life cycle of a failure-and-recovery process (FRP), and represents the total number of nodes N (t) in failure state at time t (Eq.4). The expected number of nodes in failure can be written in terms of rate functions, Failure-and-recovery process can be viewed as an alternative form of the birth-and-death process [15]. However, commonlyused birth-and-death processes have a stationary distribution of failure duration and assume independence between failure occurrence t and failure duration d. Here, these two assumptions are not needed. This implies that failures occurred at different time can last different duration. For example, under strong and sustained hurricane wind, failures that do not happen in day-to-day operation can occur due to, e.g., falling debris and power lines. We shall further elaborate this through the reallife example in Section VI.
A failure process and a recovery process are related by failure durations. As a special case when failure follows a nonhomogeneous Poisson process, the following theorem shows that the recovery process is also a non-homogeneous Poisson process parameterized by recovery rate λ r (t). In addition λ r (t) can be expressed by failure rate λ f (t) with the help of the distribution of failure duration g(d|t).
Theorem Assume (a) Independent failure occurrence {T } obeys a nonhomogeneous Poisson process {N f (t)} with intensity function λ f (t); (b) Failure duration {D} follows a conditional probability density function g(d|t) for d ≥ 0, t ≥ 0. Then the recovery time {T + D} is drawn from a nonhomogeneous Poisson process {N r (t)} with recovery intensity function, The proof of the theorem is given in Appendix A. Hence, if a failure process is non-homogeneous Poisson, our general formulation of the temporal random process reduces to a M t/Gt/∞ queue. M t indicates time-dependent Poisson failures. Gt indicates non-stationary distribution of failure duration. ∞ is for the infinite number of servers for repair; thus recovery can occur right after failure. Furthermore, when g(d|t) = g(d) is stationary, λ r (t) reduces to the convolution of the failure rate and service time distribution g(d). The M t/Gt/∞ model reduces to M t/G/∞ queue developed in [17].

D. Special Cases
We consider two special cases where failure, recovery and joint processes exhibit simple analytical expressions. These expressions provide insights to an entire non-stationary life cycle of failure-and-recovery process under different failure scenarios.
Case 1: Failure and recovery in day-to-day operation: When there are no significant external disruptions, power outages are assumed to occur randomly and sporadically. The failure intensity remains at a constant level, i.e., λ f (t) = λ 0 , where λ 0 ≥ 0 does not vary with time. The number of failures N f (t) thus follows a homogeneous Poisson process in dayto-day operation. The recovery rate in day-to-day operation λ r,0 (t) in Eq.18 reduces to, As t → ∞, λ r,0 (t) → λ 0 . This suggests that the recovery process in day-to-day operation is also a homogeneous Poisson process at a long time horizon. Then the expectation E{N 0 (t)} of joint failure-and-recovery process in day-to-day operation is Case 2: Surging failure during a natural disaster: Now consider a disaster scenario where failures occur suddenly and intensely. λ f (t) increases from a small value λ 0 to a large value in a short time duration 0 ≤ t < t 1 . The failure rate can then be written as, where u(t) is the unit step function, max t λ m (t) λ 0 . This corresponds to an extreme case when a disaster causes sudden failures at time 0 and then weakens right after.
λ r (t) in Equation 18 becomes, When t t 1 , λ r (t) ≈ λ 0 . Furthermore, when t 1 is sufficiently small, i.e., a surge of failures upon the disaster only lasts a short time, When the peak of surging failures λ m (0) λ 0 , the recovery rate in Eq.23 is approximately proportional to the surging failure rate and distribution of recovery time immediately after the disaster. In the long run, λ r (t) reduces to λ 0 .
Substituting Eq.21 and 23 to Eq.17, we have the expectation of the failure-and-recovery process in surging failures as, where G(t|s) = t 0 g(v|s)dv is the conditional cumulative density function (cdf) of failure duration given time t.
Additional characteristics of recovery can further reduce the above expressions. When infant recovery completely dominates a recovery process, g(d|0) = 0 when d > d 0 . Hence, for an impulse-like surge of failures at t = 0, the resulting recovery from the surge lasts d 0 duration, i.e., λr(t) .
= λm(0)g(t|0) min{t, t 1 }[u(t) − u(t − d 0 )] + λ r,0 (t). (25) Eq.24 in dominating infant recovery becomes, Here failure duration d 0 is assumed to be larger than the end of failure process t 1 to simplify the expression. Eq.25 and Eq.26 show that when a recovery process consists of only infant recovery, all failures due to disasters recover by d 0 after the failure eruption. After d 0 the distribution network resumes day-to-day operation.
When aging recovery dominates the recovery process, g(d|0) ≈ 0 for d < d 0 . This implies that recovery begins with delay d 0 after a surge of failures. The corresponding recovery rate reduces from Equation 23, where Eq.24 in dominating aging recovery becomes, Here we also assume d 0 > t 1 . Eq.28 shows another extreme case where recovery does not begin until d 0 delay from the failure eruption. Then the failures start to recover slowly. Hence, aging recovery shows the difficulty in resuming service to customers, and is thus an undesirable characteristic for the smart grid.
In the general failure-and-recovery process, recovery begins as soon as failures occur. During the disaster, the failure process dominates and E{N (t)} increases rapidly. Afterwards, the recovery process dominates. At the large time scale and when disaster lasts for a short period of time, the overlap between the failure process and recovery process can be neglected. Thus the failure-and-recovery process splits into individual processes, failure and recovery, respectively.

V. RESILIENCE
We now derive network resilience using the pertinent parameters for an entire life cycle of non-stationary failure and recovery.

A. Definition
An intuition on how to define resilience results from the previous section. Resilience should be characterized by virulence of failures λ f (t), speed of recovery g(d|t) and threshold d 0 . These three parameters determine infant recovery. A distribution network experiences a combination of infant and aging recovery in general. A network is intuitively more resilient when exhibiting more infant recovery. Hence we define the resilience as the probability of infant recovery.
Definition: Resilience: Given a threshold value d 0 on failure duration, the resilience of a power distribution network is defined as At a high-level, s(d 0 ) measures the resilience of the grid from large-scale disruptions, and reflects the ability for a distribution network as a whole to survive large-scale failures. Here we use P {D < d 0 } rather than time dependent conditional probability P {D < d 0 |t}. This is because resilience, when viewed as a network-wide quantity, should include an entire life-cycle of failure and recovery but not depend on specific time epoch t.

B. Resilience Parameters
Resilience can be expressed explicitly by the three pertinent parameters as follows, where G(d 0 |t) = P {D < d 0 |t}, and f (t) is the probability density function of failure time given in Equation 13. Eq.30 results in an alternative expression for resilience. There, s(d 0 ) is the expected value of the probability of infant recovery G(d 0 |t) averaged over the non-stationary failure process. This expression shows explicitly how resilience is determined by the three pertinent parameters.

C. Threshold
d 0 is one other pertinent parameter that determines the resilience. For a given value of d 0 , if infant recovery dominates the recovery process over the entire failure duration, s(d 0 ) → 1. If aging recovery dominates the recovery process over the entire failure process, s(d 0 ) → 0. Thus, a larger s(d 0 |t) represents a more resilient power distribution network. When s(d 0 ) = 0.5, the infant and the aging recovery take equal weight.
How to determine the value for d 0 ? d 0 can result from practical considerations or customer requirements. For example, when recovering within 24 hours is regarded as acceptable for a disaster scenario, d 0 = 24 defines infant recovery.
A value for d 0 can also be determined more objectively. An intuition results from the fact that if a network is dominated by infant recovery, most failure durations are less than d 0 . Hence, the slope of s(x) is relatively large for x < d 0 and relatively small for x > d 0 . This implies that d 0 corresponds to a pertinent change point of the slope of s(x), i.e., a deep valley in the second derivative d 2 dx 2 s(x). This is illustrated in in Fig.2(a). In contrast, when a network is dominated by aging recovery, most of the failure durations are larger than d 0 . Thus the slope of s(x) is small for x < d 0 and large for x > d 0 . d 0 then corresponds to a positive peak in d 2 dx 2 s(x) as illustrated in Fig.2(b). Hence, d 0 is determined by the largest magnitude of the second derivative of the s(x), We shall provide an example using real data in Section VI.

VI. LARGE-SCALE OUTAGES DUE TO HURRICANE IKE
We now apply the above framework on non-stationary failure-recovery processes to a real-life example of large-scale utility-service disruptions caused by a hurricane. Our focus is on using real data to learn the three resilience parameters λ f (t), g(d|t) and d 0 and then to estimate the resilience of an operational power distribution network.

A. Real Data and Processing
Hurricane Ike was one of the strongest hurricanes occurred in 2008. Ike caused large scale power failures, resulting in more than 2 million customers without electricity, and marked as the second costliest Atlantic hurricane of all time [27] [28].
Reported by National Hurricane Center [29], the storm started to cause power outages across the onshore areas in Widespread power outages were reported across Louisiana and Texas starting September 12 [28]. A major utility provider collected data on power outages from more than ten counties. Furthermore, the data set contains groups of failures that occurred within a minute. As a minute is the smallest time scale for each sample, the groups are considered as dependent failures. Dependent failures are grouped as one failed entity (i), with a unique failure occurrence time t i and duration d i . After such preprocessing, the resulting data set has 465 failed entities. Two outliers with negative failure duration are further removed. The remaining 463 failed entities from 7 am September 12 to 4 am September 14 are referred to as nodes.
is the data set we use for the rest of the paper.

B. Empirical Failure Process
We now focus on studying the empirical failure process using data set.
1) Learning failure rate: First, we use the data set to learn failure rate function λ f (t). The empirical rate function is estimated using a simple moving average [30]: , where τ > 0. The resulting rate function is overlaid with the samples on the number of failuresN f (t) in Figure 4, where each bin is of duration 1 hour, τ = 5 hours. The failure rate function shows a time-varying, i.e., nonstationary, intensity of new failure occurrence, and can be described as follows.
• Prior to 7 p.m. September 12, the intensity was low, i.e., fewer than 5 new failures occurred per hour. Hence λ 0 = 5 where λ 0 is considered as the failure rate in day-to-day operation. • At 7 p.m. September 12, the intensity increased sharply first to 25 new failures per hour. In the next 6 hours, the intensity reached the peak value of nearly 50 new occurrences per hour. This is consistent to the weather report [29] that the strong wind about 145 mph and flooding impacted the onshore areas even prior to the landfall. The time when the peak occurred coincides with the landfall at 2:10 a.m 9/13 CDT. • After staying at the high level for about 12 hours (from 7 p.m. September 12 to 7 a.m. September 13), the intensity decreased rapidly back to a low level of less than 5 new failures per hours. 2) Non-Homogeneous Poisson Model: We now consider a hypothesis H 0 that these 463 failure occurrences are governed by a non-homogeneous Poisson process (NHPP). We perform Pearson's test [31] on hypothesis H 0 that the empirical failure processN f (t) follows a non-homogeneous Poisson Process (NHPP) with intensity functionλ f (t). The test is on two aspects: (a) independence of the outages occurred at the large time scale of minutes, and (b) the sample mean, i.e., the empirical rate functionλ f (t), is sufficient for characterizing the failure process.
We divide the time duration from 7 a.m. September 12 to 4 p.m. September 14 into 400 intervals. In each interval, the number of new failures is compared with the expectation. The sum of the square errors from all intervals results in a chisquare statistic. The chi-square statistic is χ 2 = 0.79, with a degree of freedom of 2. Given a confidence level at 95%, a threshold value is obtained as χ 2 0.05,2 = 5.99, where P r(χ 2 < χ 2 0.05,2 ) = 0.95. The chi-square statistic χ 2 obtained from the data is below the threshold χ 2 < χ 2 0.05,2 . Hence hypothesis H 0 is not rejected. The detailed procedure of Pearson's test is given in Appendix B.
However, not rejecting H 0 is insufficient for accepting the hypothesis. The goodness of fit of NHPP to the data is further validated through Quantile-Quantile (QQ) plot given in Figure  5. There, the samples are compared with an non-homogeneous Poisson process with intensity functionλ f (t). The figure shows that the non-homogeneous Poisson process with the learned intensity function indeed exhibits a good fit to the data. Based on the result from Pearson's hypothesis testing and the QQ plot, these 463 power failures occur independently, and obey a non-homogeneous Poisson distribution.

C. Empirical Recovery Process
We now learn the empirical recovery process characterized by g(d|t), the conditional probability density function of failure duration given failure occurrence time. Our objective here is to identify infant and aging recovery. 2) Mixture Model: Given failure time t, we select a mixture model as the probability density function g(d|t) for duration d > 0, where l(t) is the number of mixtures at time t, ρ j (t) (1 ≤ j ≤ l) is a weighting factor for the jth mixture function g j (d|t), and ρ j (t) = 1. All these quantities vary with failure time t for a non-stationary recovery process.
A mixture model is chosen since its parameters exhibit interpretable physical meaning. A parametric family of Weibull mixtures is particularly appealing as the parameters correspond to infant and aging recovery directly [15]. Weibull distributions have been widely used in survival analysis [26] [25] and reliability theory [15], but not in characterizing recovery from large-scale disturbances and resilience parameters. Specifically, a Weibull distribution is where d > 0, k j (t) and γ j (t) are the shape and scale parameters respectively. A shape parameter k j (t) is especially important for determining the type of recovery signified by a mixture component. The smaller k j (t) is, the faster the decay rate of g j (d|t), the shorter the failure duration and thus the faster the recovery. Hence, k j (t) < 1 corresponds to infant recovery whereas k j (t) > 1 corresponds to aging recovery. Weighting factor ρ j (t) signifies the importance of a component g j (d|t). For a non-stationary recovery process, these parameters are varying with failure time t.

3) Learning Mixture Parameters:
The parameters of the mixtures of Weibull distribution are learned from the data. For simplicity, we use a piecewise homogeneous function to approximate g(d|t). We divide the failure time into m intervals. Within an interval ψ i , g(d|t ∈ ψ i ) = g i (d) is assumed to be a time homogeneous function that does not vary with failure time t. For different intervals, g(d|t ∈ ψ i )'s have different parameters for non-stationarity, where Referring to Fig.1, we divide the time into 5 intervals, {ψ i , i = 1, 2, 3, 4, 5}, the boundaries of each interval are depicted by dashed lines in Fig.1. Within each interval, the parameters of the mixture Weibull distribution are learned through maximum likelihood estimation [32], and shown in Table I. Each mixture represents one cluster of failure durations. For example, distribution of durations in ψ 1 is depicted in Fig.6, where we observe 3 clusters.
For different intervals, the failure duration exhibits distinct distribution. For example, ψ 1 (7 a.m. September 12 to 7 p.m. September 12) corresponds to the time before hurricane. In this interval, the network was not yet impacted by hurricane Ike and was under day-to-day operation. Most failure durations were as short as a few hours. ψ 2 (7 p.m. September 12 to 3 a.m. September 13) corresponds to the time right before the landfall and hurricane began to cause large-scale failures. In this interval, more prolonged failures occur and recovery became more difficult than day-to-day operation.

D. Overall Failure-and-Recovery Process
We now compare the empirical failure-and-recovery process {N (t)} with the learned processN (t). N (t) is obtained by directly adding up the number of failed nodes from the actual data.N (t) is obtained by reconstructing the temporal process with learnedλ f (t) andλ r (t) through Eq.17. Fig.7 shows the comparisons. The dottedN (t) is obtained with the assumption that g(d) is stationary over time. The dashedN (t) is obtained with the piecewise stationary g(d|t) given in Table I. All estimated processes are able to capture the trend in the data. However, the stationary distribution of failure durations g(d) deviates significantly from the actual sample path N (t). This shows that the piecewise stationary g(d|t) better approximates the actual failure-and-recovery process.

E. Resilience
To evaluate resilience, we first calculate the probability of infant recovery within each interval ψ i as shown in Table I. The resilience curve s(x) is then obtained by averaging these probabilities over failure time. Fig.8 depicts s(x) together with Fig. 7. Comparison between the joint failure-recovery process N (t) from the data set and the reconstructed processN (t) using learned parameters.
its second derivative d 2 dx 2 s(x). s(x) is a concave function. Hence, the threshold is obtained as, we compute the resilience Hence, 47.22% of the recoveries occurred within 13 hours whereas 52.78% occurred later than 13 hours. Such resilience close to 0.5 reveals the combined nature of infant and aging recovery in the distribution network. Specifically, s(13) < 0.5 indicates a slight dominance of aging recovery over infant recovery.

VII. DISCUSSION AND CONCLUSION
A non-stationary random process has been derived to model large-scale failure and recovery of a power distribution network under an external disturbance. The non-stationary network failure and recovery are characterized by time-varying failure rate and probability distribution of recovery time. These two quantities, together with a threshold on recovery time, define network resilience as the probability of rapid recovery. Analytical expressions have been derived to characterize the non-stationary failure and recovery under extreme conditions. Real data has been obtained from an operational distribution network for a large-scale external disturbance, Hurricane Ike. Model parameters of the failure-recovery process have been estimated through non-stationary learning using the real data. Pearson's statistical test is applied to validate the assumptions of the failure model. The resilience metric has been evaluated, and shown for the operational network that 47% of 5000+ failures recovered within 13 hours whereas the remaining prolonged as long as 12 days. This provides a quantitative measure of the resilience of the distribution network and a baseline for improvement. How to improve resilience to reduce failures needs to be further incorporated in the resilience metric. Non-stationary learning algorithms need to be further developed for on-line learning of time-varying data.

VIII. ACKNOWLEDGEMENT
The authors would like to thank Chris Kung, Jae Won Choi, Daniel Burnham and Xinyu Dai at Georgia Tech for data processing, Anthony Kuh at University of Hawaii for helpful discussions on distribution networks. Support from National Science Foundation (ECCS 0952785) is gratefully acknowledged.

APPENDIX A
Proof of Theorem: In order to prove that the recovery process {N r (t)} is a Poisson process, we just need to show the increment of recovery process, i.e., recoveries occurs in (t, t + τ ), is an independent Poisson random variable. Now consider non-overlapping intervals O 1 , ..., O k . We say a failure is type i, i = 1, ..., k, if it recovers in the interval O i . The number of the type i events equals to the number of recoveries occur in the interval O i . By Proposition 5.3 in [15], it follows that the number of recoveries occur in the interval O i , i.e., the increment of recovery process, is an independent Poisson random variable with mean, where P i (s) is the probability that a failure is type i. Thus, we prove that {N r (t)} is a Poisson process. Then we compute λ r (t) and show the non-homogeneity. Consider interval (t, t + τ ) and a failure occurs at time s, where s < t. The probability that this failure recovers during (t, t + τ ) is G(t + τ − s|s) − G(t − s|s). Eq.37 becomes, where G(t + τ − s|s) − G(t − s|s) = g(t − s|s)τ + o(τ ). We can rewrite Eq.38 as, The right-hand-side is just λ r (t) (Eq.14). Hence, the recovery rate function λ r (t) is, Due to the non-homogeneity of {N f (t)}, λ f (t) is a timevarying function, so λ r (t) is also time-varying function. Hence, we show the non-homogeneity of {N r (t)}, which completes the proof.

APPENDIX B Pearson's Hypothesis Test:
The hypothesis test is based on a chi-square statistic which compares the failure occurrence times with their sample mean. The details of testing H 0 that failure occurrence times are drawn from a NHPP {N f (t)} are given here. 3. Count O j , which is the observed number of intervals with j failure occurrences, where j = 0, 1, ..., k.
4. Use the estimatedλ f (t) to compute E j , which is the expected number of intervals with j failure occurrences. . χ 2 is a chisquare statistic, with degree of freedom dof = k−(number of independent parameter fitted)−1. Since one parameterλ f (t) is fitted, the degree of freedom is k − 2. 6. Given a confidence level, for instance 95%, we obtain a threshold value χ 2 0.05,dof . The hypothesis H 0 is rejected if χ 2 < χ 2 0.05,dof ; otherwise, H 0 cannot be rejected.