Journal of Software Engineering and Applications

Volume 10, Issue 13 (December 2017)

ISSN Print: 1945-3116   ISSN Online: 1945-3124

Google-based Impact Factor: 1.22  Citations  h5-index & Ranking

Fault Tolerance for Lifeline-Based Global Load Balancing

HTML  XML Download Download as PDF (Size: 4210KB)  PP. 925-958  
DOI: 10.4236/jsea.2017.1013053    817 Downloads   1,704 Views  Citations

ABSTRACT

Fault tolerance has become an important issue in parallel computing. It is often addressed at system level, but application-level approaches receive increasing attention. We consider a parallel programming pattern, the task pool, and provide a fault-tolerant implementation in a library. Specifically, our work refers to lifeline-based global load balancing, which is an advanced task pool variant that is implemented in the GLB framework of the parallel programming language X10. The variant considers side effect-free tasks whose results are combined into a final result by reduction. Our algorithm is able to recover from multiple fail-stop failures. If recovery is not possible, it halts with an error message. In the algorithm, each worker regularly saves its local task pool contents in the main memory of a backup partner. Backups are updated for steals. After failures, the backup partner takes over saved copies and collects others. In case of multiple failures, invocations of the restore protocol are nested. We have implemented the algorithm by extending the source code of the GLB library. In performance measurements on up to 256 places, we observed an overhead between 0.5% and 30%. The particular value depends on the application’s steal rate and task pool size. Sources of performance overhead have been further analyzed with a logging component.1

Share and Cite:

Fohry, C. , Bungart, M. and Plock, P. (2017) Fault Tolerance for Lifeline-Based Global Load Balancing. Journal of Software Engineering and Applications, 10, 925-958. doi: 10.4236/jsea.2017.1013053.

Cited by

[1] Task-Level Resilience: Checkpointing vs. Supervision
International Journal of Networking and …, 2022
[2] Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems
2021
[3] Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks
2021
[4] Checkpointing and Localized Recovery for Nested Fork-Join Programs
2021
[5] A Comparison of Application-Level Fault Tolerance Schemes for Task Pools
2020
[6] Resilience in high-level parallel programming languages
2019
[7] A Java Task Pool Framework providing Fault-Tolerant Global Load Balancing
International Journal of Networking and Computing, 2018
[8] A Selective and Incremental Backup Scheme for Task Pools
2018
[9] Fehlertoleranz und Elastizität für ein Framework zur globalen Lastenbalancierung
2018

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.