Performance analysis of fault tolerant algorithms for the heat equation in three space dimensions

H. Ltaief, M. Garbey, E. Gabriel

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Based on distributed and uncoordinated check pointing, numerical methods presented in this chapter can reconstruct a consistent state in parallel application, despite storing checkpoints of various processes at different time steps. The main purpose of these algorithms is to avoid the expensive rollback operation to the last consistent distributed checkpoint, losing all the subsequent work and adding a significant overhead for applications running on thousands of processors because of coordinated checkpoints. The first method, the forward implicit scheme, requires for the reconstruction procedure, the boundary variables of each time step to be stored along with the current solution; the second method, based on explicit space/time marching, requires check pointing the solution of each process every time step. To stabilize the scheme, a hyperbolic regularization such as the telegraph equation that is a perturbation of the heat equation may be added. Performance results comparing both methods with respect to the checkpoints overhead have been presented. The checkpointing infrastructure implemented in the 3D-heat equation uses two groups of processes a solver group composed by processes that will solve the problem itself and a spare group of processes whose main function is to store the local data from solver processes. © 2007

Original languageEnglish (US)
Title of host publicationParallel Computational Fluid Dynamics 2006
PublisherElsevier
Pages123-130
Number of pages8
ISBN (Print)9780444530356
DOIs
StatePublished - 2007

ASJC Scopus subject areas

  • Chemical Engineering(all)

Fingerprint

Dive into the research topics of 'Performance analysis of fault tolerant algorithms for the heat equation in three space dimensions'. Together they form a unique fingerprint.

Cite this