Extending the ACTS Toolkit for Wide Area Execution:

Supporting DOE Applications on Computational Grids


Project Goals

The goal of the project "Extending the ACTS Toolkit for Wide Area Execution: Supporting DOE Applications on Computational Grids" is to expand the functionality and applicability of the ACTS Toolkit to support execution in networked computing systems. The proposed work is motivated by the requirements of DOE grand challenge, collaboratory, and other applications that need to use distributed supercomputers, scientific instruments, storage systems, and/or display devices. These applications can use ACTS Toolkit libraries within individual components, but also require new mechanisms for such purposes as locating and scheduling computers, invoking components on those computers, and communicating results between computers and file systems.

Fundamental to this project is the vision of a "computational grid," an integrated computing environment that provides convenient but high-performance access to geographically distributed resources. This project will both provide the basic mechanisms required to build computational grids, and make these mechanisms accessible to DOE applications by integrating them with ACTS Toolkit libraries. These ambitious goals are possible because we build on the results of the Globus project at Argonne and USC/ISI, which is already developing relevant software technologies. Hence, we are able to propose an ACTS project that will, with only moderate funding, achieve four tasks:

  1. Integrate Globus mechanisms with ACTS Toolkit components, hence enabling distributed execution of ACTS Toolkit applications;
  2. Develop new grid functionality required for DOE applications: object code management, advanced resource brokers, remote I/O, certificate-based security, and instrumentation and monitoring;
  3. Use the ACTS/Globus software to construct a prototype DOE computational grid, linking resources at Argonne, LBNL, LANL, and other sites; and
  4. Demonstrate and evaluate the benefits of this computational grid and the associated ACTS/Globus system in DOE grand challenge and collaboratory applications.

This work addresses ACTS project goals in four ways. It expands capabilities of the ACTS Toolkit interface by restructuring existing ACTS Toolkit components to utilize Globus services, hence enabling execution in networked environments. It builds new toolkit capabilities by developing complementary libraries that interoperate with ACTS Toolkit to provide, for example, object code management and remote I/O. It supports evaluation of ACTS Toolkit capabilities by allowing Toolkit components to be applied to new DOE applications, such as the APS grand challenge. Finally, it links ACTS and Collaboratories activities, by addressing the integration of high-performance computing and networked execution.

We believe that the project is also significant in a broader sense. Computational grids seem certain to have a major impact on the practice of science and engineering. DOE researchers have been early advocates of the concept and leaders in the development of key technologies. Yet despite tremendous intellectual and physical resources, there is no "DOE grid" that is usable by ordinary scientists. This project takes a first step in that direction and we hope will lay the groundwork for a wide range of exciting projects in the future.

This project involves both university researchers (at USC Information Sciences Institute) and DOE laboratory researchers (at Argonne National Laboratory).

Progress

We report briefly on progress in four areas.

Grid Technology Research and Development

We directed the design and implementation and deployment of a distributed fault detection service for wide area applications. Fault detection is the basic underpinning from which a variety of fault recovery strategies can be constructed. We reported on these results in a paper that was presented at the High-Performance Distributed Computing Conference. In our initial implementation, faults were reported to a generic fault reporting service. In the second year of the project, this was extending by the development of a fault-reporting API. Using this API it is now possible to write application-specific fault recovery strategies.

We designed the Globus Access to Secondary Storage (GASS) service, a Grid service that provides applications with efficient access to remotely stored data. Our initial design and implementation focused primarily on remote reading and writing of whole files. Building on our experiences with GASS, we have initiated the design of a second-generation service that provides mechanisms for access to parts of files, supports the management of file replication, and addresses metadata descriptions of file contents. Both GASS and our second-generation remote data management system have been documented in research papers published in various I/O related conferences.

We have designed and implemented a executable management system called the Globus Executable Management system (GEM). GEM is used by the Globus startup mechanisms to pre-stage an executable on a remote host, enabling a Globus application to run on that host, even if it is not installed. The runtime component of GEM facilitates target-specific application naming, enabling different versions of an executable to be staged depending on the target architecture type, operating system version, etc. GEM also facilitates the construction of network-based executable repositories, which simplify the process of resource allocation in Grid environments.

Coupling with ACTS Toolkit Components and other Tools

We have worked with a number of ACTS toolkit projects to incorporate various Globus services, including those listed above, into ACTS Toolkit systems. These systems include HPC++ (Indiana U.), PAWS (LANL), the LSA toolkit (Indiana U.), and the CCAT Common Component Architecture (CCA) prototype (Indiana U.).

We have also participated in CCA meetings and contributed, in particular, to discussions of component registries and repositories. We have collaborated with other DOE 2000 participants in investigating the use of the Globus Metacomputing Directory Service to publish information about application components that are to be composed. (These techniques are used, in particular, within the LSA toolkit being developed at Indiana University.) This work has motivated proposals for standard object class definitions within the Grid Information Service working group of the Grid Forum.

We have contributed to the development of a Grid enabled version of MPI, called MPICH-G. This implementation builds on the MPICH software developed at Argonne National Laboratory and allows MPI applications to execute in the Grid environment without change. Building on Globus communication libraries (Nexus), this version of MPI can exploit more than one communication method at a time. For example, on the LLNL ASCI Blue Pacific machine, shared memory, high-speed interconnect, and IP networking protocols are used within a single application to optimize performance. The use of Globus resource management tools (GRAM and DUROC), remote I/O (GASS) and the executable management tools described above provide seamless access to remote Grid resources, managing authentication to Grid resources, managing differences between local scheduling systems and providing remote access to data files and diagnostic streams.

Other DOE-relevant tools that build on Globus services include the following:

 

Application Success Stories

In an effort to evaluate the utility of many of the facilities that we have discussed above, we have engaged in the development of a substantial application of considerable significance to DOE. In conjunction with a DOE Grand Challenge Project in supercomputer enhanced X-Ray sources, we have developed a general framework for constructing data-processing pipelines and using this framework to perform on-line tomographic image processing of data generated at the Advanced Photon Source (APS). Although the APS-specific aspects of this work are funded by the grand challenge project, the construction of the data-processing pipeline has been performed under the auspices of the DOE 2000 effort. Using this technology, we recently demonstrated near real-time reconstruction and visualization of a micro-machined gear. Such a result would not have been possible without the Grid infrastructure developed as part of this effort.

Other application successes have been achieved working with applications associated with ASCI ASAP centers (U.Chicago and Caltech), in the context of DOE NGI projects (climate, high energy physics, CorridorOne), and in joint efforts with the DOE2000 Collaboratories project.

 

Deployment of Grid Technologies

We have worked with staff from DOE laboratories to extend the Globus Ubiquitous Supercomputing Testbed Organization (GUSTO), our Grid testbed, to DOE laboratories, including ANL, ORNL, PNNL, LANL, LBNL, SNL, and LLNL. These various installations have all been successful in that software has been installed successfully and laboratory staff have been able to engage in experiments with Grid computing. However, in part because of recent DOE security restrictions and in part because of resource limitations, these installations remain research prototypes that have not yet progressed to true production infrastructure. We are currently working in particular with LANL and LBNL to take the next step and transition some Globus services to "production" use. This involves extending Grid services to deal with new firewalls installed as a result of changes in DOE security policies.

In support of this work and the tool and application work noted above, we have visited all of the labs listed, taught tutorials at various locations, provided technical support, and hosted lab staff at Globus user meetings and other events.


Future Plans

We plan further work in each of the four areas listed above: technology R&D, integration of Grid technologies into ACTS Toolkit components, application outreach, and Grid deployment. In the technology R&D area, our current focus areas are as follows:

References

1P. Stelling, I. Foster, C. Kesselman, C. Lee, G. von Laszewski, A Fault Detection Service for Wide Area Distributed Computations, Proc. 7th IEEE Symp. on High Performance Distributed Computing, 268-279, 1998.
2Joseph Bester, Ian Foster, Carl Kesselman, Jean Tedesco, Steven Tuecke, GASS: A Data Movement and Access Service for Wide Area Computing Systems, Proc. IOPADS'99, ACM Press, 1999.
 3Ann Chervenak, Ian Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke, The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Data Sets, Proc. NetStore, 1999.
4Ian Foster and Nicholas Karonis, A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems, Proc. SC’98, 1998.
5 Gregor von Laszewski, Ian Foster, Joseph A. Insley, John Bresnahan, Carl Kesselman Mei Su, Marcus Thiebaux, Mark L. Rivers, Ian McNulty, Brian Tieman, and Steve Wang. Real-time analysis, visualization, and steering of microtomography experiments at photon sources. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 1999.