The chapter presents one new component of a grid solver (G-solver), which is a general framework to efficiently solve a broad variety of PDE problems in grid environments. A grid can be seen as a large and complex system of heterogeneous computers, where individual nodes and network links can fail. A G-solver must be efficient and robust to solve the large problems that justify grid environments. This implies that a grid solver should maintain a high level of numerical efficiency in a heterogeneous environment while being tolerant to high latency and low bandwidth communication, as well as system and numerical failures. The chapter also focuses on fault tolerance. The state of the art in fault tolerance for long running applications on a grid of computers is to checkpoint the state of the full application and then rollback when a node fails. However, this approach does not scale. As the number of nodes and the problem size increases, the cost of check pointing and recovery increases, while the mean time between failures decreases.
ASJC Scopus subject areas
- Chemical Engineering(all)