Term of Use


  Man page

Contributor's corner
  Known Bugs
  Bug Report



NEW MPICH-V release (Web May 01 2007).

We introduce a new fault tolerant protocol for MPICH-2 implementation, providing Pcl (Blocking Chandy-Lamport) protocol. Protocols for high performance network in MPICH-2 will come soon.

MPICH-V is a research effort with theoretical studies, experimental evaluations and pragmatic implementations aiming to provide a MPI implementation based on MPICH, featuring multiple fault tolerant protocols.

MPICH-V provides automatic fault tolerant MPI library (i.e. a totaly unchanged application linked with the mpich-v library is a fault tolerant application).

The features of MPICH-V make it attractive for a) large clusters, b) cluster made from collection of nodes in a LAN environment (Desktop Grid), c) Grid deployments harnessing several clusters and d) campus/industry wide desktop Grids with volatile nodes (i.e. all infrastructures featuring synchronous network or controllable area network).

Currently, MPICH-V features four different protocols. We are working on a new implementation of all these protocols inside a generic framework (the ch-v device). Depetracted protocols are no longer maintained and will be replaced soon.

MPICH-PCL (new implementation)

MPICH-PCL features a Blocking Chandy Lamport fault tolerant protocol in in MPICH2 implementation. This consists of a new channel, called ft-sock, based on the TCP sock channel, and two components, a checkpoint server and a specific dispatcher, supporting large scale and heterogeneous applications. We also developped migration capability. Computation is now able to restart from a given checkpoint wave.

  • blocking coordinated checkpoint protocol
  • remote checkpoint server
  • in MPICH2 implementation
  • migration

Checkout the SC06 paper  for a complete description.

MPICH-VCL (released implementation)

MPICH-VCL features a fault tolerant protocol designed for extra low latency dependent applications. The Chandy Lamport algorithm used in MPICH-VCL do not introduce any overhead during fault free execution. However, it requires restarting all nodes (even non crached ones) in the case of a single fault. As a consequence, it is less fault resilient than message logging protocols, and is only suited for medium scale clusters.

  • coordinated checkpoint following Chandy-Lamport algorithm
  • No overhead during fault free execution
  • All nodes (even non faulty) have to be restarted from checkpoint when a crash occurs

Checkout the Cluster 2003 Presentation and article  for more informations about differencies between coordinated checkpoint algorithm and message loggin algorithms.

Checkout the IJHPCA 2005 article  for a complete comparison of all the protocols.

MPICH-V1 (deprecated implementation)

MPICH-V1 features a fault tolerant protocol designed for very large scale computing using heterogeneous networks. It's fault tolerant protocol is well suited for Desktop Grids and Global computing as it can support a very high rate of faults, but requires a larger bandwidth for stable components to reach good performance.

  • uncoordinated checkpoint
  • remote pessimistic message logging through Channel Memories (stores both message payload and total ordering of receptions)

Checkout the SC02 Presentation  for a complete description.

MPICH-V2 (deprecated implementation)

MPICH-V2 features a fault tolerant protocol designed for homogeneous network large scale computing (typicaly large clusters). Unlike MPICH-V1, it only requires a very small number of stable components to reach good performance on a cluster. It's uncoordinated checkpoint protocol makes it suitable for large scale applications, where the large number of nodes induces a low MTBF.

  • uncoordinated checkpoint
  • sender based pessimistic message logging
  • Event Logger stores only message reception causality
  • Can use many Event Loggers still reaches good performance with only one

Checkout the SC03 Presentation  for a complete description.

MPICH-VCausal (deprecated implementation)

MPICH-Vcausal features a fault tolerant protocol designed for low latency dependent applications which must be resilient to a high fault frequency. It combines the advantages of the other message logging protocols (thus providing computation progress even with high fault frequency) with direct communication and absence of acknowledgements (thus avoiding high latency impact).
  • uncoordinated checkpoint
  • sender based causal message logging
  • Causality is piggy-backed into messages, avoiding costly acknowledgements from Event logger
  • Event Logger stores only message reception causality

Checkout the Cluster 2004 article  for a complete description of the causal protocol.

MPICH-V is a project founded by

INRIA Futurs, projet Grand-Large Laboratoire de Recherche en Informatique Pôle Commun de Recherche en Informatique

Send all commentaries to: