Fault Tolerance for Lifeline-Based Global Load Balancing

HTML  XML Download Download as PDF (Size: 4210KB)  PP. 925-958  
DOI: 10.4236/jsea.2017.1013053    815 Downloads   1,698 Views  Citations

ABSTRACT

Fault tolerance has become an important issue in parallel computing. It is often addressed at system level, but application-level approaches receive increasing attention. We consider a parallel programming pattern, the task pool, and provide a fault-tolerant implementation in a library. Specifically, our work refers to lifeline-based global load balancing, which is an advanced task pool variant that is implemented in the GLB framework of the parallel programming language X10. The variant considers side effect-free tasks whose results are combined into a final result by reduction. Our algorithm is able to recover from multiple fail-stop failures. If recovery is not possible, it halts with an error message. In the algorithm, each worker regularly saves its local task pool contents in the main memory of a backup partner. Backups are updated for steals. After failures, the backup partner takes over saved copies and collects others. In case of multiple failures, invocations of the restore protocol are nested. We have implemented the algorithm by extending the source code of the GLB library. In performance measurements on up to 256 places, we observed an overhead between 0.5% and 30%. The particular value depends on the application’s steal rate and task pool size. Sources of performance overhead have been further analyzed with a logging component.1

Share and Cite:

Fohry, C. , Bungart, M. and Plock, P. (2017) Fault Tolerance for Lifeline-Based Global Load Balancing. Journal of Software Engineering and Applications, 10, 925-958. doi: 10.4236/jsea.2017.1013053.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.