69
A
Distributed Software Agent Approach to High Available Clustered Computing--Stottler
Henke Associates, Inc., 1660 South Amphlett Boulevard, Suite 350, San Mateo, CA
94402; 650-655-7242, www.stottlerhenke.com
Dr.
Charles C. Earl, Principal Investigator, ear1@stottlerhenke.com
Ms.
Melissa Thiemmedh, Business Official, thiemmedh@stottlerhenke.com
DOE
Grant No. DE-FG02-03ER83839
Amount:
$750,000
The
reliability of distributed scientific computer programs, run on large-scale,
multi-institutional, heterogeneous computing systems commonly referred to as Grids,
requires improvement. Job failure is
a major impediment to achieving acceptable levels of throughput for cluster and
grid computing tasks. This
project will develop an infrastructure that uses a collection of distributed
reasoning agents to monitor running applications and to recover from execution
failures. A collection of agents, distributed throughout the system, will
monitor the progress of batch jobs and determine if and how to restart those
jobs when they fail. Phase I
demonstrated the viability of the agent system as a way for specifying and
executing fault management policies. In
cooperation with prospective users at Lawrence Berkeley National Laboratory
(LBNL), a set of use cases was developed for failure detection and recovery, and
these use cases were implemented on a prototype small cluster. A
strategy was developed for scaling these results to larger cluster and Grid
environments. The end result of the
Phase II effort will be a system that enhances the reliability of jobs running
in Grid computing environments. Phase
II will extend the prototype system so that it runs reliably on the NERSC
Science Data Management Group (SDMG) cluster and the STAR production cluster. This
system will be used to build an architecture that can be used in Grid computing
environments.
Commercial Applications and Other
Benefits
as described by awardee:
The first and most straightforward application would be job
monitoring and recovery for high-throughput jobs submitted to clusters and
Grids. A recent estimate anticipated that the
cluster market will reach $36 billion in sales by the year 2005. The
second viable application would be to use the platform to support the monitoring
and recovery of Java Enterprise (J2EE) applications running in Grid, cluster,
and other distributed environments.