TY - CHAP
T1 - Performance analysis of fault tolerant algorithms for the heat equation in three space dimensions
AU - Ltaief, H.
AU - Garbey, M.
AU - Gabriel, E.
N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - Based on distributed and uncoordinated check pointing, numerical methods presented in this chapter can reconstruct a consistent state in parallel application, despite storing checkpoints of various processes at different time steps. The main purpose of these algorithms is to avoid the expensive rollback operation to the last consistent distributed checkpoint, losing all the subsequent work and adding a significant overhead for applications running on thousands of processors because of coordinated checkpoints. The first method, the forward implicit scheme, requires for the reconstruction procedure, the boundary variables of each time step to be stored along with the current solution; the second method, based on explicit space/time marching, requires check pointing the solution of each process every time step. To stabilize the scheme, a hyperbolic regularization such as the telegraph equation that is a perturbation of the heat equation may be added. Performance results comparing both methods with respect to the checkpoints overhead have been presented. The checkpointing infrastructure implemented in the 3D-heat equation uses two groups of processes a solver group composed by processes that will solve the problem itself and a spare group of processes whose main function is to store the local data from solver processes. © 2007
AB - Based on distributed and uncoordinated check pointing, numerical methods presented in this chapter can reconstruct a consistent state in parallel application, despite storing checkpoints of various processes at different time steps. The main purpose of these algorithms is to avoid the expensive rollback operation to the last consistent distributed checkpoint, losing all the subsequent work and adding a significant overhead for applications running on thousands of processors because of coordinated checkpoints. The first method, the forward implicit scheme, requires for the reconstruction procedure, the boundary variables of each time step to be stored along with the current solution; the second method, based on explicit space/time marching, requires check pointing the solution of each process every time step. To stabilize the scheme, a hyperbolic regularization such as the telegraph equation that is a perturbation of the heat equation may be added. Performance results comparing both methods with respect to the checkpoints overhead have been presented. The checkpointing infrastructure implemented in the 3D-heat equation uses two groups of processes a solver group composed by processes that will solve the problem itself and a spare group of processes whose main function is to store the local data from solver processes. © 2007
UR - http://www.scopus.com/inward/record.url?scp=84882894708&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84882894708&partnerID=8YFLogxK
U2 - 10.1016/B978-044453035-6/50018-3
DO - 10.1016/B978-044453035-6/50018-3
M3 - Chapter
AN - SCOPUS:84882894708
SN - 9780444530356
SP - 123
EP - 130
BT - Parallel Computational Fluid Dynamics 2006
PB - Elsevier
ER -