Model for Anticipating Failures by Omission in Calculation Grids

Computer grids are infrastructures in which heterogeneous and distributed resources offer very high computing or storage performance. If they offer ex-treme computing performance, they are also subject to the appearance of many failures related to this type of architecture. While performing tasks, if the response time of a node in the system incomprehensibly exceeds the re-quirements of the specifications, the node experiences an omission failure. The task running in the failed node will be unavailable until the node resumes normal activity. Waiting not being a possible solution, many fault tolerance methods have been proposed. Despite this large number of fault tolerance methods on offer, computer grids are still prone to many failures by omission. In this work, a numerical study of the failures by omission which occur in the calculation grids during the execution of the tasks was carried out and a model allowing anticipating its failures was proposed with the formalism PDEVS (Parallel Discret EVent system Specification).


Introduction
The need to have more and more computing or storage power for projects has pushed humans to create increasingly efficient infrastructures [1]. Since the appearance of the first computers, passing by supercomputers, then clusters to computer grids, the objective has always been the search for ever higher performance. Computer networks are today the most efficient infrastructures in terms of power [2]. IT grid is a virtual infrastructure made up of a set of potentially

Modeling and Simulation of a Computer Grid
To model and simulate omission failures in computational grids, we use the PDEVS formalism to represent 100 nodes. In the following we present a calculation grid and the formalism used

Description of a Computer Grid
Ian Foster and Carl Kesselman were the first to use the word "grid" in their book "The Grid: Blueprint for a New Computing infrastructure" published in 2003 [11]. They compare a computer network to an electrical network in terms of the availability of resources. In an electrical network, all you have to do is plug an outlet into an outlet and electricity is supposed to flow. In a computer grid, it is enough to connect to the network to have the resources of the grid. a computer grid is defined as a set of shared, distributed, heterogeneous, delocalized and autonomous hardware and software resources in which large-scale scientific problems can be solved [6] [12]. The life cycle of a job in a grid IT begins when the user obtains a certificate from a certificate authority trusted by the grid. The user then registers in a virtual grid organization and obtains their user certificate [13]. The user can then connect to the platform from a particular machine acting as a user interface from which he can launch a job. This job is submitted from the user interface to the job management system (WMS) through its "tache" input port. The WMS searches for the calculation elements (CE) or nodes available to process the job. The system if job finds the grid at rest, ie in a "attente" state, it goes into an "active" state during which it processes the job. If the job finds the system in an "active" state, the job is just waiting for its turn to be processed. If a fault by omission is committed during the execution of the job. For example, jobs are sent to a node faster than it can process them, the node gets congested and an omission failure occurs on that node and its state goes to omission until the node decongests itself [14]. Once the processing of the parts of the job is finished, the WMS reconstructs the processed task and the processed job is returned by the "error" port [12].

The PDEVS Formalism
To model and simulate failures by omission in computing grids, we used the PDEVS formalism. A computing grid being a very complex autonomous system seen as a single computer but which is a connection of several computers which share their resources in order to provide a very high power [15]. This modeling cannot be done with standard modeling formalisms because the latter are less suited to the modeling of complex and autonomous systems [9]. The modeling was done with the PDEVS modeling formalism. The PDEVS formalism allows modeling of causal and deterministic systems [10]. Its role is to provide a simple technique of parallelization of calculations [13]. A PDEVS atomic model is based on continuous time, inputs, outputs, states and functions [16]. More complex models are built by connecting several atomic models in a hierarchical fashion. The interactions are ensured by the input and output ports of the models, which favors modularity [17]. A computing grid is a set of interconnected nodes with the same goal. To model a computer grid, we therefore start by proposing the model of a node as an atomic model. Then we propose a grid model as being a coupled model which interconnects atomic models.

Model Description with Formal Specifications
The following model presents the formal specifications of an atomic model with PDEVS [18], which is an extension of the DEVS formalism [19] , , , , , where Pe and Vx are two finite sets representing the set of ports and input values; • the set of output ports and values: where Ps and Vy are two finite sets representing the set of output ports and the values carried by the events generated at the output; • the set of sequential states: where, X b all input bags belonging to X, and Q the total state set, , e is the time elapsed since last transition; • the confluent transition function where Pe and Vx are two finite sets representing the set of ports and input values; • the set of output ports and values: where Ps and Vy are two finite sets representing the set of output ports and the values carried by the events generated at the output; • the set of the component names: • The selection function allowing to resolve the model activation conflict:

Graphic Model
The model shown in Figure 1 shows a description of the graphical model with PDEVS.

Simulation
For the simulations, we use the devs-suite simulator version 4.0.0 developed by the arizona center for modeling and integrative simulation using DEVSJAVA as a language [20]. To simulate the model below, 100 nodes were used to model a calculation grid with the parameters listed in Table 1:  The 100 nodes are distributed as in the table above. For this modelization, it is considered that the distribution of tasks is perfectly balanced [16] that is to say:

Modeling a Correct Node
A good node is a node that cannot receive failures. The specifications of such a node are shown in Figure 2.
The model starts in an initial "attente" state. when it receives a job via the "tache" port which is an external transition (continuous arrows), the node goes into an "active" state during which the job is processed. At the end of processing, the processed job is returned by the "resultat" port. If several jobs have been sent to the node, the jobs are processed one after the other. Once a job has been processed, the node continues to process the next job thanks to an internal transition (interrupted arrows). If during the processing of a job, another job is sent by the "tache" port, a confluence function will select which of the jobs will be executed between the internal and external transition. The representation of The simulation of Figure 4 is structured as follows: graph a represents the time of the last event, that is to say the time of the state being processed. Graph b represents the execution time of the next event. The graph c-1 represents the evolution of the states of the system during the execution of the jobs and the graph c-2 represents the state of the system before the arrival of the jobs. The graphs d and e represent the two input ports "tache" and "faute", when a job is submitted for processing to the grid, it enters by the port "tache" and when a fault is committed it enters by the "faute" Port. The graphs f and g respectively represent the output ports "resultat" and "erreur". When a job is successfully processed, the result is returned through the "resultat" port and when the system encounters a failure, the result is returned through the "erreur" port. The job that are sent to the node are run for 100 seconds. The first tasks that are sent to the node are the first to be executed. The system boots into an "attente" state. When a transition occurs, the node executes the job for the defined execution time. When processing is complete, the corresponding output port takes the corresponding value. In this node, no failure ever occurs so no matter what port entered, the out port "resultat" will always receive an out value on each transition. In the rest of this work, we model a node that can receive failures by omission.

Modeling a Failed Node
A node may receive failures by omission from time to time. The specifications of such a node capable of receiving failures by omission are shown in Figure 5 and A failed node is a good node until a failure occurs, in which case a failure by omission. in this model, the node is in an "active" state until an omission failure occurs and the node enters an omission state. The node remains for a random period which can range from less than a second to infinity. After this period during which the node has remained in an "omission" state, it resumes its normal service and passes to the "active" state. The simulation of Figure 7 represents that of a model failing by omission.
Jobs sent to the node run for 100 seconds. The first jobs sent to the node are the first to be executed. The system starts up in an initial "attente" state. When a transition occurs, the node executes the task for the defined execution time.
When processing is complete, the corresponding output port takes on the corresponding value. In this node, when failure by omission occurs in the node (red band), either by overheating of the processor or by overload of the RAM, the node goes into an "omission" state for a random period. In this simulation the failure by omission lasts 90 seconds. After this period, the node resumed its activity as it should. Task 4 will therefore run for 190 seconds and there is now a 90 second lag on all subsequent processed jobs.

Modeling of a Calculation Grid
A computing grid is a set of heterogeneous, interconnected nodes considered as a single computer to provide great computing power. The modeling of a computing grid is therefore, with the PDEVS formalism, a coupled model where the various atomic models which constitute it are the various nodes. The formal specifications of a grid are presented in Figure 8.
In this modelization, the model N represents the calculation grid. Jobs can be submitted through two ports to the grid, "stain" and "fault". The submitted jobs arrive at the "WMSI" task manager, which is responsible for finding available nodes and sending parts of the job to them. Once the jobs have been processed by the various nodes, the parts of the job are sent back to the "WMSO" task manager to be reconstructed, restored and sent back to the output port. The simulation of this model is shown in Figure 9.
In this simulation of a computing grid, every 150 seconds the system makes sure that the nodes are still all connected to the network (green bands) by exchanging messages while performing a system backup. In the event of a failure (red band), a fault tolerance technique is applied to the next message exchange. In the case of our simulation during the processing of job5, a failure occurs. failure detection occurs at t = 450 sec. The fault tolerance technique is applied and the job is processed. In this simulation, the restart recovery caused by node failure is general, that is to say all nodes are restored. Of which at t = 500, the system restarts where the failure occurred.

Proposed Correction Model
In this part we propose a model which makes it possible to anticipate failures by omission in the calculation grid. Unlike the current method of performing system backups at a set frequency and restoring the last backup in the event of a failure, this model looks for signs that can lead to omission failure in every node. The principle is to observe the components of each node using the node's sensors and exchanged messages. For example, if the temperature sensor of the processor signals a temperature which is approaching due to the temperature of the processor chassis, the failure is reported to the task manager who will no longer send jobs to the node likely to have a failure, the model is therefore represented in Figure 10.   The formal description of this model is presented in Figure 11. The above model presents a model which allows failures by omission to be anticipated. Once the node begins to receive jobs, it goes into the "active" state. During job processing, if the node begins to receive too many jobs beyond what it can handle, the RAM occupancy rate increases to saturation causing failure by omission. Or if, during job processing, the processor temperature rises to approach the temperature of the processor chassis, the node temporarily stops operating indefinitely and causes an omission failure. These values being measured continuously, the principle is to send information on the temperature of the processor and the rate of occupation of the RAM to the task manager which will determine whether it should always forward jobs to this node or not based on the state of RAM and processor. In general, a safe range during normal workload is between 40˚C and 65˚C (or 104˚F -149˚F) [17]. The Intel Xeon E5-2630L v4 processor has a chassis temperature of Tcase = 62˚C. In this model, if the processor temperature reaches 62˚C, the node reports an "omitted state". Or if the RAM occupancy rate reaches 90% of the usable RAM, the node reports the failure and the grid task manager will no longer send tasks to the node ( Figure 12).
This simulation is that of a calculation grid which makes it possible to anticipate failures by omission. In this simulation, three nodes are likely to be bad and the rest are good nodes. If one of the nodes goes down, its processor temperature rises 15 degrees after each task. In this simulation, if the temperature of the Tcase processor chassis minus the T processor temperature is less than 10 degrees (Tcase-T ≤ 10) then a failure is reported in the node and the message is transmitted to the task manager which will ignore thereafter. Here this case occurs after the processing of job4, a node is reported as bad and the task manager  ignores it and the system continues to execute the jobs. Unlike other fault tolerance protocols which wait for the failure to occur before tolerating it, this model allows you to anticipate the failure by probing the various components of the node and determining if there may be a failure. This detection time and recovery of the system are almost zero with this model. But this model is only effective if the failure is normal, ie failures which are not linked to the attacks.

Conclusions
Failures by omission in computational grids have been studied numerically in the present work. The numerical methodology is based on the PDEVS modeling formalism in the devs-suite simulator version 4.0.0. The main results are summarized as follows: 1) We first present a correct node model which is a node which cannot receive failures and a node model which can receive failures by omission.
2) Next, we present a computational grid model which is a set of interconnected nodes. This grid which is a set of correct nodes and nodes that can receive failures by omission.
3) Finally, we propose a model which makes it possible to anticipate failures by omission in the calculation grids. This model is based on the behavior of certain components of the node to anticipate failures. Using this model will help limit omission failures that are caused by the hardware components of the node.