In large computing clusters, loss of computation as a result of a failure becomes very costly. Our goal is to build a robust cluster computing environment that can efficiently (with low overhead during the failure-free execution) recover from failures and continue the distributed computation. Because Distributed Shared Memory (DSM) provides a familiar and easy to use programming environment for cluster computing, as a first step we chose to integrate fault tolerance in a DSM system.
In this context, the choice of the base DSM coherence protocol is of great importance. First, the protocol must be scalable with the size of the cluster. Second, the protocol itself must have low memory overhead, freeing memory for fault tolerance related tasks (for example, logging). Third, it must keep a small amount of state and therefore require little state to be checkpointed or logged. The Home-based Lazy Release Consistency (HLRC) protocol has been shown to have these properties, so we consider it is a good choice for providing a scalable and fault-tolerant cluster computing platform
We have designed the support for integrating fault tolerance in a distributed shared memory (DSM) system based on the HLRC protocol. Fault-tolerant HLRC DSM uses independent checkpointing and volatile memory logging for rollback recovery from single node failures.
HLRC is a highly efficient DSM protocol (low communication and memory overhead), and therefore any design for fault tolerance support should not adversely impact its performance. Our recovery support introduces low overhead during failure-free operation of the system by using volatile memory for logging, and by aggressively exploiting the base DSM protocol.
We believe that a log-based approach has several advantages, when compared to possible alternatives which are not based on coordinated checkpointing, like communication-induced checkpointing and uncoordinated checkpointing with dependency-tracking. Logging avoids the rollback propagation and the waste of computation time specific to uncoordinated checkpointing, and allows for the output to be committed early during the execution without running a special protocol. By using pure independent checkpointing, in contrast with communication-induced checkpointing, we maintain absolute autonomy of nodes in deciding when and what to checkpoint, thus enabling local optimizations on the checkpointed state. Because no forced checkpoints are taken there is no potential for too frequent or useless checkpoints.
To reduce logging overhead we use information available in the DSM protocol so that not all messages need to be logged (as it would be the case in a pure message passing system using log-based recovery). For example, our scheme avoids the expensive logging of page transfers, or that of page requests. We also reduce the size of the individual checkpoints by making only the home nodes responsible for checkpointing their homed pages. All our checkpointing, logging and log-management algorithms are light-weight, do not add extra messages to the base DSM protocol, and do not specifically require global synchronization.
We have investigated the feasibility and efficiency of building a fault-tolerant DSM system that can recover from failures using only independent checkpointing and fast log-based recovery. Independent checkpointing is particularly interesting for computing on large local clusters and on clusters interconnected by a wide-area network, since taking a coordinated checkpoint in such cases may be either too expensive or simply impractical.
The challenge in building such a system lies in controlling the size of the logged and checkpointed state. Our goal is to achieve this without forcing processes to coordinate, in order to retain the advantages of independent checkpointing. We have designed a scheme for lazy log trimming (LLT) and checkpoint garbage collection (CGC) in a fault-tolerant HLRC DSM system that uses exclusively protocol information and does not use global synchronization. In order to evaluate our scheme we have extended a HLRC system to include our LLT and CGC algorithms along with a local checkpointing policy. Using this system we performed experiments and obtained results indicating that log and checkpoint management can be effectively performed under this scheme.
The figures below show the variation of the log size on one of the eight nodes of the cluster, sampled at the time checkpoints are taken during the run of three update-intensive applications (Ocean, Water-Spatial and Water-Nsquared) from the SPLASH/2 benchmark suite. The applications we chose generate many updates that need to be logged and therefore they provide good test cases for our LLT and CGC algorithms. The checkpointing policy we used enforced an aggressive limit of 10% of the shared memory footprint on the size of the log in volatile memory. As it can be seen, our LLT scheme can limit the amount of logged state, without recourse to global coordination.
Preliminary results of this work were published in the Rutgers University Department of Computer Science Technical Report DCS-TR-409, February 2000
We are currently implementing a prototype of the system on a PC cluster using user-level communication. The current design uses a centralized manager for computing and propagating information needed by LLT and CGC algorithms. We intend to explore the impact an alternative (lazier) log trimming scheme may have on the size of the logged state. Other future plans include addressing certain limitations of the current scheme and exploring various checkpointing policies.