| Abstract |
High-performance computing platforms like clusters, Grid, and desktop
Grids are becoming larger and subject to more frequent failures. MPI is
one of the most used message passing libraries in HPC applications. These
two trends raise the need for a fault-tolerant MPI. The MPICH-V project
focuses
on designing, implementing, and comparing several automatic fault
tolerance protocols for MPI applications.
I will present the four fault tolerant protocols implemented in MPICH-V
using MPICH, covering a large spectrum of known approaches from
coordinated checkpoint to causal message logging, and present a
performance comparison of them.
|