Title: "Portable Checkpointing and Recovery"
Authors: L.M.Silva, J.G.Silva,
S.Chapple, L.Clarke
Keywords: Fault-Tolerance, Checkpointing, Crash-Recovery, Parallel
Libraries.
Conference: 4th Int. Symposium on High Performance Distributed
Computing (HPDC-4), Pentagon
City, Virginia USA, August 1995
ABSTRACT
This paper presents a checkpointing scheme that was implemented
in a parallel library that runs on top of CHIMP/MPI. The main
goals of the checkpointing mechanism are portability and efficiency.
It runs on every platform supported by MPI in a machine-independent
way. The scheme allows the migration of checkpoints and offers
a flexible recovery mechanism based on data-reconfiguration. Some
performance results will be presented at the end of the paper
together with some techniques that can be used to increase the
efficiency of the checkpointing mechanism.