Aperiodic Checkpoint Placement Algorithms — Survey and Comparison *

In this article we summarize some aperiodic checkpoint placement algorithms for a software system over infinite and finite operation time horizons, and compare them in terms of computational accuracy. The underlying problem is formulated as the maximization of steady-state system availability and is to determine the optimal aperiodic checkpoint sequence. We present two exact computation algorithms in both forward and backward manners and two approximate ones; constant hazard approximation and fluid approximation, toward this end. In numerical examples with Weibull system failure time distribution, it is shown that the combined algorithm with the fluid approximation can calculate effectively the exact solutions on the optimal aperiodic checkpoint sequence.


Introduction
It is well known that the system failure in large-scale computer systems can lead to a huge economic or critical social loss.Checkpointing and rollback recovery is a commonly used technique for improving the reliability/availability of fault-tolerant computing systems, and is regarded as a low-cost software dependability technique from the standpoint of environment diversity.Especially, when file systems to write and/or read data are designed, checkpoint (CP) generations back up periodically/aperiodically the significant data on a primary medium to safe secondary media, and play a significant role to limit the amount of data processing for recovery actions after system failures occur.If CPs are frequently taken, a larger overhead will be incurred.Conversely, if only a few CPs are taken, a larger overhead after a system failure will be required in rollback recovery actions.Hence, it is important to determine the optimal CP sequence taking account of the trade-off between two kinds of overhead factor above.In many cases, the system failure phenomenon is described with a probability distribution called the system-failure time distribution, and the optimal CP sequence is determined based on any stochastic model.For the excellent survey on this topic, see [2,3].
First Young [4] obtains the optimal CP interval approximately for a computation restart after system failures.Baccelli [5], Chandy et al. [6,7], Dohi et al. [8][9][10], Gelenbe and Derochette [11], Gelenbe [12], Gelenbe and Hernandez [13], Goes and Sumita [14], Goes [15], Grassi et al. [16], Kobayashi and Dohi [17], Kulkarni et al. [18], Nicola and van Spanje [19], Sumita et al. [20], among others, propose performance evaluation models for database recovery, and calculate the optimal CP intervals which maximize the system availability or minimize the mean overhead during the normal operation.L'Ecuyer and Malenfant [21] formulate a dynamic CP placement problem by a Markov decision process.Ziv and Bruck [22] analyze an online algorithm for a probabilistic CP placement.Vaidya [23] examines an impact of checkpoint latency on overhead ratio for a simple CP model.Okamura et al. [24] reformulate the Vaidya model [23] with a semi-Markov decision process and further develop a reinforcement adaptive learning algorithm for CP placement.For several CP models in the literature, the periodic CP intervals are implicitly assumed.This is because the periodic CP intervals maximize the steadystate system availability, and in many cases, are better than the randomized CP ones which are given by independent and identically distributed random variables.However, it is worth noting that the periodic CP strategies can not be always validated in some cases and less performe than the aperiodic CP placement.In general, it is known that the way to place the optimal CP sequence strongly depends on both kind of objective functions (system availability, mean overhead, etc.) and kind of system-failure time distribution.Since the aperiodic CP involves the periodic CP as a special case, it is meaningful to consider the aperiodic CP placement algorithm for file systems.
When the system-failure time obeys a non-exponential distribution, it is easily shown that the aperiodic CP placement is not worse than the periodic CP one.Toueg and Babao lu [25] develop a dynamic programming (DP) algorithm which minimizes expected execution time of tasks placing CPs between two consecutive tasks under very general assumptions.Kaio and Osaki [26] consider an approximate aperiodic CP placement algorithm under the asssumption that the conditional system-failure probability is constant during the successive CPs.Fukumoto et al. [27,28] and Ling et al. [29] propose fluid approximation methods based on a variational calculus approach to derive the cost-optimal aperiodic CP sequence.Ozaki et al. [30,31] give an exact aperiodic CP placement algorithm and further develop an estimation scheme under the incomplete knowledge on systemfailure time distribution.In a fashion similar to the above approach, Dohi et al. [32] formulate another aperiodic CP placement problem with equality constraints.Iwamoto et al. [33], Okamura et al. [34,35], and Okamura and Dohi [36] propose different DP-based algorithms from Toueg and Babao lu [25] under the availability criterion, by taking account of another dependability technique, called the software rejuvenation in the presense of software aging, where the system failure caused by the aging is not exponentially distributed.Recently, Ozaki et al. [37] propose a fixed-point type algorithm for an aperiodic CP placement with an infinite operation-time horizon.In this way, considerable attentions have been paid for aperiodic CP placement problems in past.
Nevertheless, it can be pointed out that no effective aperiodic CP placement algorithm has been known yet when the number of CPs is very large.The constant hazard approximation [26] and fluid approximation [27][28][29] may poorly work in such a case.The search-based iteration algorithm in [30,31] and the DP-based algorithm in [33][34][35][36], which are regarded as exact computation algorithms, also require the very careful adjustment to determine the number of CPs if the operation time for a file system is finite.As the operation time becomes longer, in general, the number of CPs is sensitive to not only the determination of the aperiodic CP sequence but also the resulting dependability measures.In this article we summarize some aperiodic CP placement algorithms for a software system over infinite and finite operationtime horizons, and compare them in terms of computational accuracy.It is proposed to combine the fluid approximation with an exact computation algorithm in determining the initial value of the number of CPs.The idea is quite simple, but we show that the combined algorithm with the fluid approximation can calculate effectively the exact solutions on the optimal aperiodic CP sequence.

Formulation of Optimal CP Placement
First, consider a centralized file system with sequential checkpoint (CP) over an infinite time horizon.The system operation starts at time , and the CP is sequentially placed at time 1 2 to back up the data processed in the file system.At each CP,  , all the file data on the main memory is saved to a safe secondary medium, where the fixed cost (time overhead) is needed per each CP placement.It is assumed that the system operation stops during the checkpointing, so during the period 0 the file system does not deteriorate.System failure may occur according to an absolutely continuous and non-decreasing probability distribution function c   F t having density function   f t and finite mean .Upon a system failure, a rollback recovery takes place immediately where the file data saved at the last CP creation is used.Next, a CP restart is performed and the file data is recovered to the state just before the system failure occurs.The time length required for the CP restart is given by the function L , which depends on the system failure time, and is assumed to be differentiable and increasing.We call the function the recovery function in this article.After the completion of CP restart, an additional CP must be created to save the current state and the system operation restarts with the same condition as the initial point of time t . The similar cycle repeats again and again over an infinite time horizon.The problem is to determine the optimal CP sequence   maximizing the steady-state system availability: where denotes the expected operaing cost with 0 0 t  .It is evident that the underlying problem is reduced to a simple minimization problem    t . In this problem, the expected recovery cost is usually given by the affine form for the system failure time , where and are given constants.Instead, by replacing the above CP cost and recovery cost by and , this is equivallent to the classical inspection problem by Barlow and Proschan [38]. Figure 1 illustrates the configuration of the underlying CP placement with a finite operation-time horizon .
From the analogy to the inspection problem, it can be easily shown that the optimal CP sequence maximizing the steady-state system availability is a non-increasing sequence under the assumption that the system failure time distribution is PF 2 (Polya Frequency Function of Order 2) [38], if there exists the optimal CP sequence satisfying . Then, it must satisfy the following first order condition of optimality: From the condition of optimality, an algorithm to derive the optimal CP sequence which minimezes or equivalentlly maximizes can be derived as follows.
Step 3: For -th CP , if , then decrease and Go to Step 2.
Step 4: and Go to Step 2. 1 Step 5: For the resulting CP sequence 1 2 k , if k t , then Stop the procedure, where is sufficiently small tolerance value and .In the above algorithm, arbitrary increasing and decreasing operations in Steps 3 and 4 can be taken to speed up the computation.The simplest method would be the bisection serach method.As the simplest case, if the system failure time is given by the exponential distribution with mean  , it is well known that the optimal CP sequence is periodic, i.e., Since the processing time for a given transaction is in general bounded, the CP placement for an infinite-time horizon may be questionable in many practical applications.As a natural extension of the infinite-time horizon problem, it would be interesting to consider the finite operation-time horizon problem, because is a special case.Suppose that the time horizon of operation for the file system is finite, say, , which can be regarded as a fixed transaction processing time.For a finite sequence , , , , the expected operating cost is given by where . Also we suppose that the file system restarts with a fixed CP overhead 0 c just after the time , if the system failure does not occur.Since the steady-state system availability is given by the underlying maximization problem reduces to It should be noted that the recovery cost does not occur at .To simplify the notation, we define and a given .N Since the finite operation-time horizon problem involves the constraint on the number of CPs, it is impossible to apply directly the forward CP placement algorithm for an infinite operation-time horizon problem.However, by adjusting the value of , we can develop the similar algorithm to compute the optimal CP sequ-N N ence.The basic idea is to utilize the non-increasing property of CP sequence under the PF 2 assumption for an arbitrary number .Based on the result for an infinite time horizon [30,31,37], we modify the forward CP placement algorithm as follows.
N T Forward CP Placement Algorithm for a Finite Operation-Time Horizon: [30,31].
Step 1: Set the lower and upper bounds of by and , respectively.
Step 3: For , compute the CP sequence by , , , , and Go to Step 2.
Step 4.2: If , then and Go to Step 2.
Step 5: For an arbitrary tolerance level , if , then and Go to Step 2.
Step 6: For an arbitrary tolerance level , if , then and Go to Step 2.
For all possible combinations of , we calculate all expected operating costs using the above algorithm, and determine both the optimal number of CPs, N N  and its associated CP sequence , , , N N .It should be noted that the above two algorithms can be validated only when the system failure time distribution is PF 2 and the resulting CP sequence is non-increasing, i.e., 1 k k .The most significant point is that these algorithms are very fast to derive the optimal CP sequence, but strongly depend on the initial value 1 .In the worst case, it is evident that these algorithms are sometime unstable and that the resulting CP sequence may not converge to the optimal solution.To overcome this point, the careful selection of the initial value 1 is essentially needed, so we improve it by the following algorithm.
Improved Forward CP Placement Algorithm for a Finite Operation-Time Horizon: Step 1: Set 1 , , and the upper bound of serach range .
Step 2: Set and V .0 Step 2.1: Step 2.2: For , compute satisfying Step 2.3: Compute the corresponding expected operating cost and set it as j V based on .
Step 2.4: C  and its associated CP sequence N t  .Since the initial value 1 in the above algorithm can be adjusted gradually from 0, the stability for the original forward CP placement algorithm could be rather improved.However, when t t  is relatively large, the solution may still drop in the local minimum, and even the improved algorithm may fail to converge.In our numerical experiments, even when , the search of the initial value 1 was sometimes unsuccsessful.In addition, it can be obvious that the computation cost of the improved algorithm is much larger than the original forward CP placement algorithm.In the following section, we introduce more stable algorithm on computation.

Backward CP Placement Algorithm
For the same aperiodic CP placement problem, Naruse et al. [39,40] propose to solve the optimality condition in the backward manner.Letting for a g i v e n , t h e o p t i m a l C P s e q u e n c e has to satisfy the first ortder condi- , and should be the solution of the following   Although this algorithm does not depend on the PF 2 property, it is not feasible for a large number of CPs, because an explosion of the number of simultaneous equations occurs for increasing the number of CPs.In fact, the authors in [40] present only a toy problem with a very small number of CPs.
The most realistic backward algorithm is already given by Iwamoto et al. [33], and is based on the well-known dynamic programing (DP).Since this algorithm does not also depend on the PF 2 property, it is applicable even to the more general failure time distribution.During the time period between two successive CPs, the expected operation time and the mean Copyright © 2013 SciRes.JSEA time length of one cycle are given by respectively, where one cycle is defined as the time interval between two successive renewal points.In Equations ( 7) and ( 8), represents the conditional probability distribution: At the end of the operation-time , the above expressions are rewritten as follows.
From the principle of optimality, we obtain the following DP equations:  where the function is given by In the above equation,  indicates the maximum steady-state system availability and k , , are relative value functions in the DP.The derivation of the optimal CP intervals is equivalent to finding which satisfy the DP equations.Following Iwamoto et al. [33], we apply the policy iteration algorithm which is effective to solve the above type of functional equations.Instead of the original function   w  , define for convenience the following function: Then the DP-based CP placement algorithm is given in the following: Backward CP Placement Algorithm: [33].
, , , Step 3: Solve the following optimization problems: Step 4: the algorithm, where  is an error tolerance, otherwise, let : i i 1   and go to Step 2.
In Step 2 of the above algorithm, we have to calculate the relative value functions.From the original DP Equations ( 12) and ( 13), we find that the relative value functions under a fixed policy must satisfy the following linear equation: (21) where | if and , , , , ,   , k j  denotes the   , k j -element of matrix, and represents transpose of vector.Without a loss of generality, we set tr 1 0 h  in the above algorithm.For both forward and backward CP placement algorithms, it is essential to determine the number of CPs, , during the finite operation-time horizon.In other words, if the initial value of in the algorithms can be known in advance, it can be easily explored with any low-cost search technique.In the following section, we introduce two approximate algorithms for the finite operation-time horizon problem.

Constant Hazard Approximation
If the time interval between two successive CPs, , is sufficiently short, the system-failure probability during the time interval can be approximately considered as a constant, i.e., Kaio and Osaki [26] approximate the expected operating cost, as a function of

 
T N V t  under the above assumption.Here we derive the same result as [26] in a different way.Let X be the system-failure time having the probability distribution   F t .For an arbitrary probability , define the CP sequence satisfying the following quantile condition: . From a few algebraic manipulations, the expected operating cost can be represented as a function of  as By minimizing the expected operating cost with respect to  and substituting the optimal  into   , an aperiodic CP sequence is approximately derived.For this approximate algorithm, we need to determine the number of CPs in advance.Also, even though the exact number of CPs is known, the approximate algorithm does not guarantee an exactly optimal CP sequence.

Fluid Approximation
The next approximate algorithm focuses on the determination of the number of CPs.Let be the average frequency of CP placement at time instant .Then the time interval between two succsessive CPs at time is approximately given by   , the expected operating cost over an infinite operation-time horizon is approximately expressed as a functional of : Then, the optimization problem with an infinite-operation time horizon reduces to a variational culculus On the other hand, in the case with a large operation-time horizon, Ozaki et al. [30,31] assume that the probability of the occurrence of a system failure can be negligible even if the file system survives after the time horizon, and derive the average frequency of CP placement by where the control parameter  is determined so as to satisfy 0 .Naruse et al. [40] also propose a modified average frequency of CP placement by where and      is the integer part satisfying . Hence, the optimal aperiodic CP sequence is determined by or for  .Substituting the approximate CP sequence yields the following approximate expected operating cost: . As mentioned before, both two approximate algorithms do not also guarantee an exactly optimal CP sequence.However, it is worth mentioning that b in Equation ( 28) provides a very near value of the exact number of CPs.By setting b as the initial value of in the forward or backward CP placement algorithm and adjusting its integer value via a simple bisection method, we can seek the number of CPs placed up to the finite operation time .
n n N T The main difference between the constant hazard approximation and the fluid approximation is that the latter is based on the number of CPs by . For a given and , both forward and backward algorithms are applicable.By combining the fluid approximation with the forward or backward CP placement algorithm, it is possible to speed up the computation to calcurate the optimal CP sequense.

Numerical Examples
We calculate numerically the optimal CP sequence and the corresponding steady-state system availability.Suppose that the failure time distribution obeys the Weibull distribution: with shape parameter and scale parameter .In this case, the failure (hazard) rate For the operation-time horizon , we calculate the optimal CP sequence with an exact solution algorithm (forward or backward CP placement algorithm) and two approximate algorithms, and derive both the number of CPs and the steady-state system availability.When 10, 15, 20 , it is noted that the system failure time distribution is strictly DFR (Decreasing Failure Rate) and is not PF 2 .Hence we apply only the backward CP placement algorithm for this case.In the case with PF 2 , two exact solution algorithms provide the exactly same results, where the number of CPs is adjusted from the initial value b given in Equation (28).For the other model parameters, we set c 0 0.003  , and ., the optimal CP time behaves as convex functions with respect to the number of CPs for both exact and approximate methods.It can be seen that the two approximate methods poorly work except around 14-th CP.In the CFR (Constant Failure Rate) case (b) with 1.0   , the optimal CP time becomes a linear function, so all the methods give the almost same periodic CP time sequence.In the strict IFR (Increasing Failure Rate) case (c) with 2.0 , the optimal CP time shows concave functions of the number of CPs, and two approximate methods provide rather close values to the exact solution.In Figures 3 and 4, we show the optimal CP time sequence with and .As the finite operation time becomes longer, the constant hazard approximation tends to be far from the exact solution, when the system failure time distribution is strict IFR.On the other hand, the fluid approximation gives the almost similar CP time sequence to the exact solution.However, in Figure 3(a), the fluid approximation takes a bit differnt value of the optimal CP time sequence from the exact solution.In  other words, the computation accuracy for two approximate algorithms becomes worse as the shape parameter deviates from 1.0   more and more.In Figure 5, we investgate the dependence of the optimal aperiodic CP time on the scale parametr and the operation time in the strict IFR case.Looking at (a) to (f), only the constant hazard approximation shows the different behavior from the exact solutions.
Next, we compare two approximation methods with the exact computation in terms of steady-state system availability more precisely.In Table 1, we present the steady-state system availability and the number of CPs AV n  in Equations ( 26) and (29) are calculated, where is used for the  d  fluid approximation.Tables 1 and 2 present the dependence of the shape and the scale parameters on the steady-state system availability, respectively.When  increases, then the system tends to fail as the operation time goes on, and the system availability does not always decrease in Table 1.In this case, the number of CPs does not always increase from Table 1.When  increases, then the mean time to system failure (MTTSF) also increases and the steady-state system availability is   expected to increse.This intuitive observation as well as the decreasing trend of the number of CPs are corect from Table 2.If we compare the minimum steady-state system availability calculated by the exact solution algorithm with the other ones, the relative error in both approximate methods can be found at the order of .Especially, the reason why the constant hazard approximation works well is that it increases the number of CPs so as to increase the system availability.This implies that even the constant hazard approximation probvides the nice approximate performance on the maximum system availability.On the other hand, the number of CPs in the fluid approximation is also close to the exact 0.01% one.Through these numerical examples, it can be concluded that if the steady-state system availability is evaluated with higher accuracy such as four or five nines, it is needed to apply the exact solution algorithms, where the initial value of the number of CPs is decided by the fluid approximation.Otherwise, i.e., the three nines level is enough for calculating the steady-state system availability, then the fluid approximation provides rather good CP schedule.

Conclusion
In this article we have introduced some exact and appro-Aperiodic Checkpoint Placement Algorithms-Survey and Comparison 52 ximate algorithms to create the aperiodic checkpoint schedule maximizing the steady-state system availability, when the file system operation terminates at a fixed time horizon.Since the determination of the number of checkpoints within the finite operation-time period has been an essential problem, we have combined the fluid approximation with the exact solution algorithm.In numerical examples with Weibull system failure time distribution, we have calculated the optimal aperiodic checkpoint sequence under different parametric circumstances.It has been shown that the combined algorithm with the fluid approximation could calculate effectively the exact solutions on the optimal aperiodic checkpoint sequence.

Figure 1 .
Figure 1.Configuration of the aperiodic CP placement with a finite operation-time horizon T.

L
. When the recocery cost function is the affine form i.e.,   0

Figure 2
depicts the optimal CP time sequence with different shape parameter 0

Figure 3 .
Figure 3. Aperiodic CP placement with different shape parameters for T = 15.(a) Case 1: γ = 0.5 and θ = 10; (b) Case 2: γ = 1.0 and θ = 10; (c) Case 3: γ = 2.0 and θ = 10.for varying the failure parameters  ,    when three algorithms are used.In the terms of approximate algorithms, is caluculated by substituting each approximate CP sequence into Equation (5), so that   T N AV t   T AV   and
By solving the corresponding Euler equation, we have the optimal CP frequency