Introduction
NEW MPICH-V release (Web May 01 2007). We introduce a new fault tolerant protocol for MPICH-2 implementation, providing Pcl (Blocking Chandy-Lamport) protocol. Protocols for high performance network in MPICH-2 will come soon.
|
MPICH-V is a research effort with theoretical studies, experimental
evaluations and pragmatic implementations aiming to provide a MPI
implementation based on MPICH, featuring multiple fault tolerant protocols.
MPICH-V provides automatic fault tolerant MPI library (i.e. a totaly unchanged application linked with the mpich-v library is a fault tolerant application).
The features of MPICH-V make it attractive
for a) large clusters, b) cluster made from collection of nodes
in a LAN environment (Desktop Grid), c) Grid deployments harnessing
several clusters and d) campus/industry wide desktop Grids with
volatile nodes (i.e. all infrastructures featuring synchronous
network or controllable area network).
Currently, MPICH-V features four different protocols. We are working on a new implementation of all these protocols inside a generic framework (the ch-v device). Depetracted protocols are no longer maintained and will be replaced soon.
MPICH-PCL (new implementation)
MPICH-PCL features a Blocking Chandy Lamport fault tolerant protocol in in MPICH2 implementation. This consists of a new channel, called ft-sock, based on the TCP sock channel, and two components, a checkpoint server and a specific dispatcher, supporting large scale and heterogeneous applications. We also developped migration capability. Computation is now able to restart from a given checkpoint wave.
|
- blocking coordinated checkpoint protocol
- remote checkpoint server
- in MPICH2 implementation
- migration
|
Checkout the SC06 paper for a complete description.
MPICH-VCL (released implementation)
MPICH-VCL features a fault tolerant protocol designed for extra low latency dependent applications. The Chandy Lamport algorithm used in MPICH-VCL do not introduce any overhead during fault free execution. However, it requires restarting all nodes (even non crached ones) in the case of a single fault. As a consequence, it is less fault resilient than message logging protocols, and is only suited for medium scale clusters.
|
- coordinated checkpoint following Chandy-Lamport algorithm
- No overhead during fault free execution
- All nodes (even non faulty) have to be restarted from checkpoint when a crash occurs
|
Checkout the Cluster 2003 Presentation and article for more informations about differencies between coordinated checkpoint algorithm and message loggin algorithms.
Checkout the IJHPCA 2005 article for a complete comparison of all the protocols.
MPICH-V1 (deprecated implementation)
MPICH-V1 features a fault tolerant protocol designed for very large scale computing using heterogeneous networks. It's fault tolerant protocol is well suited for Desktop Grids and Global computing as it can support a very high rate of faults, but requires a larger bandwidth for stable components to reach good performance.
|
- uncoordinated checkpoint
- remote pessimistic message logging through Channel Memories (stores both message payload and total ordering of receptions)
|
Checkout the SC02 Presentation for a complete description.
MPICH-V2 (deprecated implementation)
MPICH-V2 features a fault tolerant protocol designed for homogeneous network large scale computing (typicaly large clusters). Unlike MPICH-V1, it only requires a very small number of stable components to reach good performance on a cluster. It's uncoordinated checkpoint protocol makes it suitable for large scale applications, where the large number of nodes induces a low MTBF.
|
- uncoordinated checkpoint
- sender based pessimistic message logging
- Event Logger stores only message reception causality
- Can use many Event Loggers still reaches good performance with only one
|
Checkout the SC03 Presentation for a complete description.
MPICH-VCausal (deprecated implementation)
MPICH-Vcausal features a fault tolerant protocol designed for low latency dependent applications which must be resilient to a high fault frequency. It combines the advantages of the other message logging protocols (thus providing computation progress even with high fault frequency) with direct communication and absence of acknowledgements (thus avoiding high latency impact).
|
- uncoordinated checkpoint
- sender based causal message logging
- Causality is piggy-backed into messages, avoiding costly acknowledgements from Event logger
- Event Logger stores only message reception causality
|
Checkout the Cluster 2004 article for a complete description of the causal protocol.
|