Fault Tolerance Mechanisms in Distributed Systems


The use of technology has increased vastly and today computer systems are interconnected via different communication medium. The use of distributed systems in our day to day activities has solely improved with data distributions. This is because distributed systems enable nodes to organise and allow their resources to be used among the connected systems or devices that make people to be integrated with geographically distributed computing facilities. The distributed systems may lead to lack of service availability due to multiple system failures on multiple failure points. This article highlights the different fault tolerance mechanism in distributed systems used to prevent multiple system failures on multiple failure points by considering replication, high redundancy and high availability of the distributed services.

Share and Cite:

Sari, A. and Akkaya, M. (2015) Fault Tolerance Mechanisms in Distributed Systems. International Journal of Communications, Network and System Sciences, 8, 471-482. doi: 10.4236/ijcns.2015.812042.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Sebepou, Z. and Magoutis, K. (2011) CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators. Proceedings of IEEE/IFIP 41st International Conference on Dependable Systems and Networks, 145-156.
[2] Sari, A. and Necat, B. (2012) Impact of RTS Mechanism on TORA and AODV Protocol’s Performance in Mobile Ad Hoc Networks. International Journal of Science and Advanced Technology, 2, 188-191.
[3] Chen, W.H. and Tsai, J.C. (2014) Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems.
[4] Sari, A. and Necat, B. (2012) Securing Mobile Ad Hoc Networks against Jamming Attacks through Unified Security Mechanism. International Journal of Ad Hoc, Sensor & Ubiquitous Computing, 3, 79-94.
[5] Balazinska, M., Balakrishnan, H., Madden, S. and Stonebraker, M. (2008) Fault-Tolerance in the Borealis Distributed Stream Processing System. ACM Transactions on Database Systems, 33, 1-44.
[6] Sari, A. (2014) Security Approaches in IEEE 802.11 MANET—Performance Evaluation of USM and RAS. International Journal of Communications, Network, and System Sciences, 7, 365-372.
[7] Elnozahy, E.N.M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (2002) A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34, 375-408.
[8] Sari, A. (2014) Security Issues in RFID Middleware Systems: A Case of Network Layer Attacks: Proposed EPC Implementation for Network Layer Attacks. Transactions on Networks & Communications, 2, 1-6.
[9] Andrew, S. (1995) Tanenbaum Distributed Operating Systems. Prentice Hall, Upper Saddle River.
[10] Sari, A. (2015) Lightweight Robust Forwarding Scheme for Multi-Hop Wireless Networks. International Journal of Communications, Network and System Sciences, 8, 19-28.
[11] Coulouris, G., Dollimore, J. and Kindberg, T. (2001) Distributed Systems: Concepts and Design, 4th Edition, Pearson Education Ltd., New York.
[12] Carter, W.C. and Bouricius, W.G. (1971) A Survey of Fault-Tolerant Computer Architecture and Its Evaluation. Computer, 4, 9-16.
[13] Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE Computer Group News, 2, 2-17.
[14] Sari, A. (2015) Two-Tier Hierarchical Cluster Based Topology in Wireless Sensor Networks for Contention Based Protocol Suite. International Journal of Communications, Network and System Sciences, 8, 29-42.
[15] Cooper, A.E. and Chow, W.T. (1976) Development of On-Board Space Computer Systems. IBM Journal of Research and Development, 20, 5-19.
[16] Tanenbaum, A. and Van Steen, M. (2007) Distributed Systems: Principles and Paradigms. 2nd Edition, Pearson Prentice Hall, Upper Saddle River.
[17] Koren, I. and Krishna, C.M. (2007) Fault-Tolerance Systems. Elsevier Inc., San Francisco.
[18] Sari, A. and Onursal, O. (2013) Role of Information Security in E-Business Operations. International Journal of Information Technology and Business Management, 3, 90-93.
[19] Avizienis, A., Kopetz, H. and Laprie, J.C. (1987) Dependable Computing and Fault-Tolerant Systems, Volume 1: The Evolution of Fault-Tolerant Computing. Springer-Verlag, Vienna, 193-213.
[20] Sari, A. and Caglar, E. (2015) Performance Simulation of Gossip Relay Protocol in Multi-Hop Wireless Networks. Social and Applied Sciences Journal, Girne American University, 7, 145-148.
[21] Harper, R., Lala, J. and Deyst, J. (1988) Fault-Tolerant Parallel Processor Architectural Overview. Proceedings of the 18st International Symposium on Fault-Tolerant Computing, Tokyo, 27-30 June 1988.
[22] Sari, A. and Rahnama, B. (2013) Addressing Security Challenges in WiMAX Environment. In: Proceedings of the 6th International Conference on Security of Information and Networks, ACM Press, New York, 454-456.
[23] Briere, D. and Traverse, P. (1993) AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant Systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, Toulouse, 22-24 June 1993.
[24] Charron-Bost, B., Pedone, F. and Schiper, A. (2010) Replication: Theory and Practice. Lecture Notes in Computer Science, 5959.
[25] Sari, A. (2015) Security Issues in Mobile Wireless Ad Hoc Networks: A Comparative Survey of Methods and Techniques to Provide Security in Wireless Ad Hoc Networks. New Threats and Countermeasures in Digital Crime and Cyber Terrorism, IGI Global, Hershey, 66-94.
[26] Sari, A. and Rahnama, B. (2013) Simulation of 802.11 Physical Layer Attacks in MANET. Proceedings of the Fifth International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), Madrid, 5-7 June 2013, 334-337.
[27] Tanenbaum, A.S. and van Steen, M. (2002) Distributed Systems: Principles and Paradigms. Pearson Education Asia.
[28] Sari, A., Rahnama, B. and Caglar, E. (2014) Ultra-Fast Lithium Cell Charging for Mission Critical Applications. Transactions on Machine Learning and Artificial Intelligence, 2, 11-18.
[29] Ebnenasir, A. (2005) Software Fault-Tolerance. Computer Science and Engineering Department, Michigan State University, East Lansing.
http://www.cse.msu.edu/~cse870/Lectures/SS2005 /ft1.pdf
[30] Birman, K. (2005) Reliable Distributed Systems: Technologies, Web Services and Applications. Springer-Verlag, Berlin.
[31] Obasuyi, G. and Sari, A. (2015) Security Challenges of Virtualization Hypervisors in Virtualized Hardware Environment. International Journal of Communications, Network and System Sciences, 8, 260-273.
[32] Avizienis, A. (1975) Architecture of Fault-Tolerant Computing Systems. Proceedings of the 1975 International Symposium on Fault-Tolerant Computing, Paris, 18-20 June 1975, 3-16.
[33] Sari, A. (2015) A Review of Anomaly Detection Systems in Cloud Networks and Survey of Cloud Security Measures in Cloud Storage Applications. Journal of Information Security, 6, 142-154.
[34] Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE Computer Group News, 2, 2-17.
[35] Sari, A. (2014) Influence of ICT Applications on Learning Process in Higher Education. Procedia—Social and Behavioral Sciences, 116, 4939-4945.
[36] Huang, M. and Bode, B. (2005) A Performance Comparison of Tree and Ring Topologies in Distributed Systems. Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, 4-8 April 2005, 258.1.
[37] Huang, M. (2004) A Performance Comparison of Tree and Ring Topologies in Distributed System. Master’s Thesis.
[38] Minar, N. (2001) Distributed Systems Topologies: Part 1.
[39] Wiesmann, M., Pedone, F., Schiper, A., Kemme, B. and Alonso, G. (2000) Understanding Replication in Databases and Distributed Systems. Research Supported by EPFL-ETHZ DRAGON Project and OFES.
[40] Herlihy, M. and Wing, J. (1990) Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems, 12, 463-492.
[41] Ahamad, M., Hutto, P.W., Neiger, G., Burns, J.E. and Kohli, P. (1994) Causal Memory: Definitions, Implementations and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, Atlanta.
[42] Rahnama, B., Sari, A. and Makvandi, R. (2013) Countering PCIe Gen. 3 Data Transfer Rate Imperfection Using Serial Data Interconnect. Proceedings of the 2013 International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), Konya, 9-11 May 2013, 579-582.
[43] Budhiraja, N., Marzullo, K., Schneider, F.B. and Toueg, S. (1993) The Primary-Backup Approach. In: Mullender, S., Ed., Distributed Systems, ACM Press, New York, 199-216.
[44] Gifford, D.K. (1979) Weighted Voting for Replicated Data. Proceedings of the Seventh ACM Symposium on Operating Systems Principles, Pacific Grove, 10-12 December 1979, 150-162.
[45] Osrael, J., Froihofer, L., Goeschka, K.M., Beyer, S., Galdamez, P. and Munoz, F. (2006) A System Architecture for Enhanced Availability of Tightly Coupled Distributed Systems. Proceedings of the First International Conference on Availability, Reliability and Security, Vienna, 20-22 April 2006.
[46] Cao, H.H. and Zhu, J.M. (2008) An Adaptive Replicas Creation Algorithm with Fault Tolerance in the Distributed Storage Network. Proceedings of the Second International Symposium on Intelligent Information Technology Application, Shanghai, 20-22 December 2008, 738-741.
[47] Shye, A., Blomstedt, J., Moseley, T., Reddi, V. and Connors, D.A. (2008) PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures. IEEE Transactions on Dependable and Secure Computing, 6, 135-148.
[48] Agarwal, V. (2004) Fault Tolerance in Distributed Systems. Indian Institute of Technology, Kanpur.
[49] Jung, H., Shin, D., Kim, H. and Lee, H.Y. (2005) Design and Implementation of Multiple Fault Tolerant MPI over Myrinet (M3). In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society, Washington DC, 32.
[50] Elnozahy, M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (1996) A Survey of Rollback-Recovery Protocols in Message Passing Systems. Technical Report CMU-CS-96-81, School of Computer Science, Carnegie Mellon University, Pittsburgh.
[51] Alvisi, L. and Marzullo, K. (1995) Message Logging: Pessimistic, Optimistic, and Causal. Proceedings of the 15th International Conference on Distributed Computing, Systems (ICDCS 1995), Vancouver, 30 May-2 Jun 1995, 229-236.
[52] Garg, V.K. (2010) Implementing Fault-Tolerant Services Using Fused State Machines. Technical Report ECE-PDS-2010-001, Parallel and Distributed Systems Laboratory, ECE Department, University of Texas, Austin.
[53] Xiong, N., Cao, M., He, J. and Shu, L. (2009) A Survey on Fault Tolerance in Distributed Network Systems. Proceedings of the 2009 International Conference on Computational Science and Engineering, Vancouver, 29-31 August 2009, 1065-1070.
[54] Tian, D., Wu, K. and Li, X. (2008) A Novel Adaptive Failure Detector for Distributed Systems. Proceedings of the 2008 International Conference on Networking, Architecture, and Storage, Chongqing, 12-14 June 2008, 215-221.

Copyright © 2023 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.