Title: "Portable Checkpointing and Recovery"

Authors: L.M.Silva, J.G.Silva, S.Chapple, L.Clarke

Keywords: Fault-Tolerance, Checkpointing, Crash-Recovery, Parallel Libraries.

Conference: 4th Int. Symposium on High Performance Distributed Computing (HPDC-4), Pentagon City, Virginia USA, August 1995


ABSTRACT
This paper presents a checkpointing scheme that was implemented in a parallel library that runs on top of CHIMP/MPI. The main goals of the checkpointing mechanism are portability and efficiency. It runs on every platform supported by MPI in a machine-independent way. The scheme allows the migration of checkpoints and offers a flexible recovery mechanism based on data-reconfiguration. Some performance results will be presented at the end of the paper together with some techniques that can be used to increase the efficiency of the checkpointing mechanism.