TY - CHAP

T1 - On a fault tolerant algorithm for a parallel CFD application

AU - Garbey, M.

AU - Ltaief, H.

N1 - Copyright:
Copyright 2013 Elsevier B.V., All rights reserved.

PY - 2006

Y1 - 2006

N2 - The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.

AB - The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.

UR - http://www.scopus.com/inward/record.url?scp=78651524865&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78651524865&partnerID=8YFLogxK

U2 - 10.1016/B978-044452206-1/50015-X

DO - 10.1016/B978-044452206-1/50015-X

M3 - Chapter

AN - SCOPUS:78651524865

SN - 9780444522061

SP - 133

EP - 140

BT - Parallel Computational Fluid Dynamics 2005

PB - Elsevier

ER -