The goals of our work on DSM are:
to leverage performance characteristics of high speed networks to make the PC cluster a competitive platform for parallel computing.
to use multithreading as a means to achieve good scalability for parallel applications
to enable the software DSM system to changing cluster configurations
to enable existing parallel applications written for shared-memory multiprocessors, to use software DSM
Our work continues the research in Home-based Lazy Release Consistent (HLRC) DSM protocols started at Princeton, focusing on scalability, fault-tolerance, adaptive protocols and non-scientific applications. We implement the shared memory abstraction as a software layer on top of a fast communication library. With this layer, a cluster of commodity PCs/workstations can provide the same programming interface as a hardware cache-coherent machine. The critical question is with what level of performance? Relaxed consistency models, such as release consistency (RC), are a well-established solution to reduce the communication traffic in software DSM systems and thus improve performance. Recently, such relaxed consistency models have gained wider acceptance among programmers of parallel applications.
To provide good scalability over a larger class of applications, we are working towards exploiting the benefits of multithreading in our HLRC protocol. We are particularly interested in using multithreading for dynamic reconfiguration of the cluster in order to optimize the performance of parallel applications.We have also explored techniques for optimizing the HLRC protocol by adapting its behavior according to the sharing patterns exhibited by parallel applications. These techniques are reminiscent of the Adaptive DSM System and include home migration, adaptation between single and multiple writer protocols, and adaptation between invalidate and update protocols. The optimized protocol is currently called Home-based Adaptive Protocol (HAP). This work is done in collaboration with the Parallel Computing Lab of COPPE Systems Engineering/UFRJ, Brazil.
Our fault tolerance research targets scalable distributed programming environments using the DSM abstraction. Examples of such environments are large LAN-based clusters and meta-clusters interconnected by a wide-area network. Fault tolerance support should not add too much overhead during the failure-free operation of the system, and the mechanisms it uses must work without global coordination, which may be either expensive or impractical in the targeted environments. We have designed a fault-tolerant DSM based on the HLRC protocol that addresses these issues.
Finally, we are also investigating new application domains for software DSM such as parallel data mining and continuous media applications. We have already developed a parallel data mining engine that achieves comparable on a high-end multiprocessor and a cluster of PCs.
We have developed a prototype of HLRC using Virtual Interface Architecture (VIA) on a cluster of PCs running Linux. A source-code distribution of our HLRC protocol for Linux/VIA is available for download. The next release will include versions of HLRC over VIA on Windows and HLRC over MPI on Linux, and the corresponding versions for SMP PCs.
Adaptive Techniques for Home-Based Software DSMs
Lauro Whately, Raquel Pinto, Muralidharan Rangarajan, Liviu Iftode, Ricardo Bianchini and Claudio L. Amorim. Proceedings of the 13th Symposium on Computer Architecture and High Performance Computing, September 2001
Software Distributed Shared Memory over Virtual Interface Architecture: Implementation and Performance
Murali Rangarajan and Liviu Iftode. Proceedings of The Annual Linux Showcase, Extreme Linux Workshop, Atlanta, October 10-12. Rutgers University, Department of Computer Science Technical Report, DCS-TR-413, April 2000
Multi- threaded HLRC DSM
Murali Rangarajan, Thu Nguyen and Liviu Iftode. Workshop on Shared Memory Multi-processors, Atlanta, May 1999