Fault Tolerance Mechanisms in Distributed Systems

The use of technology has increased vastly and today computer systems are interconnected via different communication medium. The use of distributed systems in our day to day activities has solely improved with data distributions. This is because distributed systems enable nodes to organise and allow their resources to be used among the connected systems or devices that make people to be integrated with geographically distributed computing facilities. The distributed systems may lead to lack of service availability due to multiple system failures on multiple failure points. This article highlights the different fault tolerance mechanism in distributed systems used to prevent multiple system failures on multiple failure points by considering replication, high redundancy and high availability of the distributed services.


Introduction
A faulty system creates a human/economic loss, air and rail traffic control, telecommunication loss, etc.The need for a reliable fault tolerance mechanism reduces these risks to a minimum.In distributed systems, faults or failures are limited or part.A part failure in distributed systems is not equally critical because the entire system would not be offline or brought down, for example a system having more than one processing cores (CPU), and if one of the cores fails the system would not stop functioning as though that's the one physical core in the system.Hence, the other cores would continue to function and process data normally.Nevertheless, in a non-distributed system when one of its components stops functioning, it would lead to a malfunction of the entire system or program and the corresponding processes would stop.
Fault tolerance is the dynamic method that's used to keep the interconnected systems together, sustain relia-bility, and availability in distributed systems.The hardware and software redundancy methods are the known techniques of fault tolerance in distributed system.The hardware methods ensure the addition of some hardware components such as CPUs, communication links, memory, and I/O devices while in the software fault tolerance method, specific programs are included to deal with faults.Efficient fault tolerance mechanism helps in detecting of faults and if possible recovers from it.
There are various definitions to what fault tolerance is.In dealing with fault tolerance, replication is typically used for general fault tolerance method to protect against system failure [1] [2].Sebepou et al. highlighted three major forms of replication mechanism which are [1] [2]: The State Machine; Process Pairs; Roll Back Recovery.

1) State Machine
In this mechanism, the process state of a computer system is replicated on autonomous computer system at the same time, all replica nodes process data in analogous or matching way and also there's coordination in their process among the replica nodes and all the inputs are sent to all replica at the same time [2] [3].An active replica is an example of state machine [3] [4].
2) Process Pairs The process pairs functions like a master (primary)/slave (secondary) link in replication coordination.The primary workstation acts in place of a master to transmit its corresponding input to the secondary node.Both nodes maintain a good communication link [3]- [5].
3) Roll Back Recovery (Check-Point-Based) This mechanism collects check point momentarily and transfers these checkpoint states to a stable storage device or backup nodes.This enables a roll back recovery to be done successfully when or during recovery process.The checkpoint is been reconstructed prior to the recent state [3]- [6].

Distributed System
Distributed system are systems that don't share memory or clock, in distributed systems nodes connect and relay information by exchanging the information over a communication medium.The different computer in distributed system have their own memory and OS, local resources are owned by the node using the resources.While the resources that is been accessed over the network or communication medium is known to be remote resources [5]- [7]. Figure 1 shows the communication network between systems in the distributed environment.
In distributed system, pool of rules are executed to synchronise the actions of various or different processes on a communication network, thereby forming a distinct set of related tasks [6]- [9].The independent system or computers access resources remotely or locally in the distributed system communication environment, these resources are put together to form a single intelligible system.The user in the distributed environment is not aware of the multiple interconnected system that ensures the task is carried out accurately.In distributed system, no single system is required or carries the load of the entire system in processing a task [8] [9].

Distributed System Architecture
The architecture of distributed system is built on existing OS and network software [8].Distributed system encompasses the collection of self-sufficient computers that are linked via a computer network and distribution middleware.The distribution middleware in distributed system, enables the corresponding computers to manage and share the resources of the corresponding system, thus making the computer users to see the system as a single combined computing infrastructure [9] [10].Middleware is the link that joins distributed applications across different geographical locations, different computing hardware, network technologies, operating systems, and programming languages.The middleware delivers standard services such as naming, concurrency control, event distribution, security, authorization etc. Figure 2 shows the distributed system architecture, with the middleware offering its services to the connected systems in the distributed environment [10] [11].
In distributed system, the structure can be fully connected networks or partially connected networks [12]- [15].As shown in Figure 3, a full connected network, is a network where each node is connected together.The disadvantage of this network is that when a new computer added, it physically increase the number of nodes connected to nodes, because the network connects node to node.Because of the increase in nodes, the number of file descriptors and difficulty for each node to communicate are increased heavily.File Descriptors is an intellectual indicator used to access a file or other input/output resource, such as a pipe or network connection [15]- [17].Hence, the ability for the networked systems to continue functioning well is limited to the connected nodes ability of open the file descriptors and also the capability to manage new connections.The fully linked network systems are reliable because the message sent from one node to another node goes through one link, and when a node fails to function or a link fails, other nodes in the network can still communicate with other nodes.In the partially connected network, some node have direct links while others don't.Some models of partially connected networks are star structured networks, multi-access bus net work, ring structured network, and tree structured network.In Figures 4-7 illustrates the corresponding networks.The disadvantages in these network    in are: in the Star designed network, when the main node fails to function the entire networked system stops to function they collapse.In multi-access bus network, nodes are connected to each other through a communication link "a bus".If the bus link connecting the nodes fails to function, all other nodes can't connect to each other, and the performance of the network drops as more nodes or computers are added to the system or heavy traffic occurs in the system.In the ring network, nodes are connected at least to two other nodes in the network creating a path for signals to be exchanged between the connected nodes.As new nodes are added to the network, the transmission delay becomes longer.If a node fail every other node in the network can be inaccessible.In the tree structured network, this is like a net work with hierarchy, each node in the network have a fixed number nodes that is attached to it in the sub level of the tree.In this network messages that are transmitted from the parent to the child nodes goes through one link.
For a distributed system to perform and function according to build, it must have the following characteristics; Fault Tolerant, Scalability, Predictable Performance, Openness, Security, and Transparency.

Fault Tolerance Systems
Fault tolerance system is a vital issue in distributed computing; it keeps the system in a working condition in subject to failure.The most important point of it is to keep the system functioning even if any of its part goes off or faulty [18]- [20].
For a system to be fault tolerant, it is related to dependable systems.Dependability covers some useful requirements in the fault tolerance system these requirements include: Availability, Reliability, Safety, and Maintainability.
Availability: This is when a system is in a ready state, and is ready to deliver its functions to its corresponding users.Highly available systems works at a given instant in time.
Reliability: This is the ability for a computer system run continuously without a failure.Unlike availability, reliability is defined in a time interval instead of an instant in time.A highly reliably system, works constantly in a long period of time without interruption.
Safety: This is when a system fails to carry out its corresponding processes correctly and its operations are incorrect, but no shattering event happens.
Maintainability: A highly maintainability system can also show a great measurement of accessibility, especially if the corresponding failures can be noticed and fixed mechanically.
As we have seen, fault tolerance system is a system which has the capacity of or to keep running correctly and proper execution of its programs and continues functioning in the event of a partial failure [21] [22].Although sometimes the performance of the system is affected due to the failure that occurred.Some of the fault is narrowed down to Hardware or Software Failure (Node Failure) or Unauthorised Access (Machine Error).Errors caused by fault tolerance events are separated into categories namely; performance, omission, timing, crash, and fail-stop [22]- [24].
Performance: this is when the hardware or software components cannot meet the demands of the user.Omission: is when components cannot implement the actions of a number of distinctive commands.Timing: this is when components cannot implement the actions of a command at the right time.Crash: certain components crash with no response and cannot be repaired.Fail-stop: is when the software identifies errors, it ends the process or action, this is the easiest to handle, sometimes its simplicity deprives it from handling real situations.
In addition to the error timing, three situations or form can be distinguished: 1) Permanent error; these causes damage to software components and resulting to permanent error or damage to the program, preventing it from running or functioning.In this case a restart of the program is done, an example is when a program crashes.2) Temporary error; this only result to a brief damage to the software component, the damage gets resolved after some time and the corresponding software continues to work or function normally.3) Periodic errors; these are errors that occurs occasionally.For example when there's a software conflict between two software when run at the same time.In dealing with this type of error, one of the programs or software is exited to resolve the conflict.
Most computers if not all have some fault tolerance technique such as micro diagnosis [25] [26], parity checking [27]- [29], watchdog timers [30]- [34], etc. an incompletely fault tolerant system have inbuilt resources to cause a reduction in its specified computing capability and reduce to a smaller or lower system by removing some programs that have been used previously or by reducing the rate at which specified processes are executed.
The reduction is due to the decrease or slowdown in the operational hardware configuration or it may be some design faults in the hardware.

Basic Concept of Fault Tolerance Systems
Fault tolerance mechanism can be divided into three stages; Hardware, Software, and System Fault [34].
Hardware Fault Tolerance: This involves the provision of supplementary backup hardware such as; CPU, Memory, Hard disks, Power Supply Units, etc. hardware fault tolerance can only deliver support for the hardware by providing the basic hardware backup system, it can't stop or detect error, accidental interfering with programs, program errors, etc.In hardware fault tolerance, computer systems that resolves fault occurring from hardware component automatically are built.This technique often partition the node into units that performance as a fault control area, each module is backed up with a defensive redundancy, the reason is that if one of the modules fails, the others can act or take up its function.There are two approach to hardware fault recovery namely; Fault Masking and Dynamic Recovery [35]- [37].
Fault Masking: This is an important redundancy method that fully covers faults within a set of redundant units or components.Other identical units carry out or implement the same tasks, and their outputs were noted to have removed errors created by a defective module.Commonly used fault masking module it the Triple Modular Redundancy (TMR).The TMR triplicate the circuitry and selected [38] [39].The selected electrical system can also be triplicated so that the selected circuitry failures can be corrected by the same process.The selected system in the TMR needs more hardware, this enables computations to continue without been interrupted when a fault is detected or occurs, tolerating the operating system to be used [40] [41].
Dynamic Recovery: In dynamic recovery, special mechanism is essential to discover faults in the units, perform a switch on a faulty module, puts in a spare, and carryout some software actions necessary to restore and continue computation such as; rollback, initialization, retry, and restart.This requires special hardware and software to make this work in single computer, but in a multicomputer situation, the function is carried out by other processors [42]- [45].
Software Fault Tolerance: This is a special software designed to tolerate errors that would originate from a software or programming errors.The software fault tolerance utilize the static and dynamic redundancy methods similar to those used for hardware fault [46].N-version programming approach uses the static redundancy like an independently program that does the same function creating out that are selected at special checkpoint.Another approach is the Design Diversity which this adds both hardware and software fault tolerance by deploying a fault tolerant system using diverse hardware and software in the redundant channels.In the Design diversity, every channel is intended to carry out the same function and a mechanism is in check to see if any of the channels changes from others.The aim of the Design Diversity is to tolerate faults from both hardware and software.This approach is very expensive, its use mainly is in the aircraft control applications.
Note: Software Fault Tolerance also consists of checkpoints storage and rollback recovery.Checkpoints are like a safe state or snapshot of the entire system in a working state.This is done regularly.The snapshot holds all the required information to restart the program from the checkpoint.The usefulness of the software fault tolerance is to create an application that would store checkpoints regularly for targeted systems.
System Fault Tolerance: This is a complete system that stores not just checkpoints, it detects error in application, it stores memory block, program checkpoint automatically.When a fault or an error occurs, the system provides a correcting mechanism thereby correcting the error.Table 1 shows the comparison of three fault tolerance mechanism.

Replication Based Fault Tolerance Technique
The replication based fault tolerance technique is one of the most popular method.This technique actually replicate the data on different other system.In the replication techniques, a request can be sent to one replica system in the midst of the other replica system.In this way if a particular or more than one node fails to function, it will not cause the whole system to stop functioning as shown in Figure 8. Replication adds redundancy in a system.There are different phase in the replication protocol which are client contact, server coordination, execution, agreement, coordination and client response.Major issues in replication based techniques are consistency, degree of replica, replica on demand etc.
Consistency: This is a vital issue in replication technique.Several copies of the same entity create problem of consistency because of update that can be done by any of the user.The consistency of data is ensured by some criteria such as linearizability [47], sequential consistency and casual consistency [48] etc. sequential and linearizability consistency ensures strong consistency unlike casual consistency which defines a weak consistency criterion.For example a primary backup replication technique guarantee consistency by linerarizability, likewise active replication technique.
Degree or Number of Replica: The replication techniques utilises some protocols in replication of data or an object, such protocol are: Primary backup replication [49], voting [50] and primary-per partition replication [51].In the degree of replication, to attain a high level of consistency, large number of replicas is needed.If the number of replica is low or less it would affect the scalability, performance and multiple fault tolerance capability.To solve the issue of less number of replica, in [51] adaptive replicas creation algorithm was proposed.

Process Level Redundancy Technique
This fault tolerance technique is often used for faults that disappears without anything been done to remedy the situation, this kind of fault is known as transient faults.Transient faults occurs when there's a temporary malfunction in any of the system component or sometimes by environmental interference.The problem with transient faults is that they are hard to handle and diagnose but they are less severe in nature.In handling of transient fault, software based fault tolerance technique such as Process-Level Redundancy (PLR) is used because hardware based fault tolerance technique is more expensive to deploy.As shown in Figure 9, the PLR compares processes to ensure correct execution and also it creates a set of redundant processes apiece application process.Redundancy at the process level enables the OS to schedule easily processes across all accessible hardware resources.
The PLR provides improved performance over existing software transient fault tolerance techniques with a 16.9% overhead for detection of fault [53].PLR uses a software-centric approach which causes a shift in focus from guaranteeing hardware execution correctly to ensuring a correct software execution.
Check Pointing and Roll Back: This is a popular technique which in the first part "check point" stores the current state of the system and this is done occasionally.The check point information is stored in a stable storage device for easy roll back when there's a node failure.Information that is stored or checked includes environment, process state, value of the registers etc. these information are very useful if a complete recovery needs to be done [50] [51].recovery are the checkpoint and log based roll back recovery technique.Each of the type of rollback recovery technique uses different mechanism; the checkpoint based uses the checkpoints states that it has stored in a stable storage device, while the log based rollback recovery techniques combines both check pointing and logging of events [51].
In recovery form system failures, there are two type of check point technique that is used; coordinated and uncoordinated checkpoint techniques, these techniques are related with message logging [34].
Coordinated Check Point: In this technique, check are coordinated to save a consistent state because the coordinated checkpoint are consistent set of checkpoint.If the checkpoints are not consistent a full and complete rollback of the system can't be done [52].In a situation where there's frequent failure, coordinated check point technique can't be used.The recovery time can be set to a higher value or lower value, when set to a lower value, it improves performance of the technique because it only select the recovery to last correct state of the system instead from the very first state or checkpoint.
Uncoordinated Check Point: This technique combines the message logging to ensure that the rollback state is correct.The uncoordinated check point technique executed checkpoints independently as well as recovery.There are three type of message logging protocols: optimistic, pessimistic and casual.In the optimistic protocol ensures all messages are logged.The pessimistic protocol makes sure that all the message that is received by a process are logged appropriately and stored in a stable and reliable storage media before it is forwarded into the system.While the causal protocol just log the message information of a process in all processes that are causally dependent [53].

Fusion Based Technique
Replication is the most widely used method or technique in fault tolerance.The main downside is the multiple of backups that it incurs.Because the backups increase as faults increase and the cost of management is very expensive, the fusion based technique solves that problem.Fusion based technique stands as an alternative because it requires fewer backup machines compared to the replication based technique.As shown in Figure 11, the backup machines are fused corresponding to the given set of machines [53] [54].The fusion based technique has a very high overhead during recovery process and it's acceptable in low probability of fault in a system.
From Table 2, it is clear that all methods having capability to handle multiple faults.In all methods  performance can be improved by focusing on or addressing the serious aspects involved.In all the techniques involved, there is strong need for reliable, accurate and pure adaptive multiple failure detector mechanism [53], [54].

Conclusion
Fault tolerance is a major part of distributed system, because it ensures the continuity and functionality of a system at a point where there's a fault or failure.This research showed the different type of fault tolerance technique in distributed system such as the Fusion Based Technique, Check Pointing and Roll Back Technique, and Replication Based Fault Tolerance Technique.Each mechanism is advantageous over the other and costly in deployment.In this paper we highlight the levels of fault tolerance such as the hardware fault tolerance which ensures that additional backup hardware such as memory block, CPU, etc., software fault tolerance system comprises of checkpoints storage and rollback recovery mechanisms, and the system fault tolerance is a complete system that does both software and hardware fault tolerance, to ensure availability of the system during failure, error or fault.Future research would be conducted on comparing the various data security mechanisms and their performance metrics.

Figure 2 .
Figure 2. A simple architecture of a distributed system.

Figure 10 Figure 8 .
Figure 8. Replication based technique in distributed system.

Table 1 .
Comparison of fault tolerance mechanism.

Table 2 .
shows compares the different fault tolerance technique or mechanism in distributed system.