69  

A Distributed Software Agent Approach to High Available Clustered Computing--Stottler Henke Associates, Inc., 1660 South Amphlett Boulevard, Suite 350, San Mateo, CA  94402; 650-655-7242, www.stottlerhenke.com
Dr. Charles C. Earl, Principal Investigator, ear1@stottlerhenke.com 
Ms. Melissa Thiemmedh, Business Official, thiemmedh@stottlerhenke.com 
DOE Grant No. DE-FG02-03ER83839
Amount:  $750,000

The reliability of distributed scientific computer programs, run on large-scale, multi-institutional, heterogeneous computing systems commonly referred to as Grids, requires improvement.  Job failure is a major impediment to achieving acceptable levels of throughput for cluster and grid computing tasks.  This project will develop an infrastructure that uses a collection of distributed reasoning agents to monitor running applications and to recover from execution failures. A collection of agents, distributed throughout the system, will monitor the progress of batch jobs and determine if and how to restart those jobs when they fail.  Phase I demonstrated the viability of the agent system as a way for specifying and executing fault management policies.  In cooperation with prospective users at Lawrence Berkeley National Laboratory (LBNL), a set of use cases was developed for failure detection and recovery, and these use cases were implemented on a prototype small cluster.  A strategy was developed for scaling these results to larger cluster and Grid environments.  The end result of the Phase II effort will be a system that enhances the reliability of jobs running in Grid computing environments.  Phase II will extend the prototype system so that it runs reliably on the NERSC Science Data Management Group (SDMG) cluster and the STAR production cluster.  This system will be used to build an architecture that can be used in Grid computing environments.  

Commercial Applications and Other Benefits as described by awardee:  The first and most straightforward application would be job monitoring and recovery for high-throughput jobs submitted to clusters and Grids.  A recent estimate anticipated that the cluster market will reach $36 billion in sales by the year 2005.  The second viable application would be to use the platform to support the monitoring and recovery of Java Enterprise (J2EE) applications running in Grid, cluster, and other distributed environments.