On a fault tolerant algorithm for a parallel CFD application

M. Garbey, H. Ltaief

Research output: Chapter in Book/Report/Conference proceedingChapter

3 Scopus citations

Abstract

The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.

Original languageEnglish (US)
Title of host publicationParallel Computational Fluid Dynamics 2005
PublisherElsevier
Pages133-140
Number of pages8
ISBN (Print)9780444522061
DOIs
StatePublished - Dec 1 2006

ASJC Scopus subject areas

  • Chemical Engineering(all)

Fingerprint Dive into the research topics of 'On a fault tolerant algorithm for a parallel CFD application'. Together they form a unique fingerprint.

Cite this