TY - CHAP
T1 - On a fault tolerant algorithm for a parallel CFD application
AU - Garbey, M.
AU - Ltaief, H.
N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2006
Y1 - 2006
N2 - The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.
AB - The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.
UR - http://www.scopus.com/inward/record.url?scp=78651524865&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78651524865&partnerID=8YFLogxK
U2 - 10.1016/B978-044452206-1/50015-X
DO - 10.1016/B978-044452206-1/50015-X
M3 - Chapter
AN - SCOPUS:78651524865
SN - 9780444522061
SP - 133
EP - 140
BT - Parallel Computational Fluid Dynamics 2005
PB - Elsevier
ER -