welcome

2005 MCS Divisional Seminars & Colloquia


MPICH-V: Toward an Automatic/Scalable Fault-Tolerant MPI for
Clusters & Grids

  Pierre Lamarinier

  Hosted by  Rusty Lusk

10:30 AM, April 26, 2005
Building 221,  Room A216


Abstract

High-performance computing platforms like clusters, Grid, and desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for a fault-tolerant MPI. The MPICH-V project focuses on designing, implementing, and comparing several automatic fault tolerance protocols for MPI applications.
 

I will present the four fault tolerant protocols implemented in MPICH-V using MPICH, covering a large spectrum of known approaches from coordinated checkpoint to causal message logging, and present a performance comparison of them.

 

[MCS | Research | Resources | People | Collaboration | Software | Publications | Information]