mcs | publications | abstracts

2004 Abstracts of MCS Reports and Preprints

Reports

E. D. Dolan, J. J. Moré, and T. S. Munson, "Benchmarking Optimization Software with COPS 3.0," Technical Memorandum ANL/MCS-TM-273, February 2004. We describe version 3.0 of the COPS set of nonlinearly constrained optimization problems.  We have added new problems, as well as streamlined and improved most of the problems.  We also provide a comparison of the FILTER, KNITRO, LOQO, MINOS, and SNOPT solvers on these problems.
E. D. Frank and R. L. Stevens, "Process Modeling for High-Throughput Biology," Technical Memorandum ANL/MCS-TM-276, October 2004. We interviewed researchers at two high-throughput biology laboratories in order to identify and to understand their workflows. We constructed and studied a detailed process model for one of the labs. Based on these efforts and on the benefits of the model to the experimenters, we discuss here ideas for applying process modeling to high-throughput biology laboratories.
K. Keefe, "Local Search Strategies for Equational Satisfiability," Technical Memorandum ANL/MCS-TM-269, August 2004. The search for models of an algebra is an important and demanding aspect of automated reasoning. Typically, a model is represented in the form of a matrix or a set of matrices. When a model is found that satisfies all the given theorems of an algebra, it is called a solution model. This paper considers algebras that can be represented by using a single operation, by way of the Sheffer stroke.  The characteristic of needing only one operation to represent an algebra reduces the problem by requiring a search through all instances of a single matrix. This search is simple when the domain size is small, say 2, but for a larger domain size, say 10, the search space increases dramatically. Clearly, a method other than a brute-force, global search is desirable.

Most modern model-finding programs use a global search; instead of checking every possible matrix, however a set of heuristics is used that allows the search space to be dramatically smaller and thus increases the likelihood of reaching a solution. An alternative approach is local search. This paper discusses several local search strategies that were applied to the problem of equational satisfiability.
J. M. Schopf and S. J. Newhouse, "State of Grid Users: 25 Conversations with UK eScience Groups," Technical Memorandum ANL/MCS-TM-278, October 2004.

During July and August 2004 we visited various applied science and middleware groups in the U.K. in order to gather basic information on the services and functionality these projects are using. Our motivation was to help guide the development of future activities and priorities within the U.K.'s Open Middleware Infrastructure Institute [OMII] and the Globus Alliance [Globus] and to inform the wider Grid community of the status of some current services. We held meetings with application developers with some Grid (generally Globus Toolkit 2 or Globus Toolkit 3) or Web services experience, those with software that might be of broader use or interest, and those who have expressed dissatisfaction with current tools, in order to understand their issues in more detail. The twenty-five groups, listed in the Appendix, included representative applications from biology, chemistry, physics, climatology, and other scientific fields, as well as a smaller set of basic tool builders. In addition, informal discussions took place at several workshops during this time to get a broader scope in certain areas.

In this article we detail the results of our conversations with users. While the results are not unexpected, we note that the ranking of needs by the users was quite different from what many tool developers have assumed. We detail our findings in the areas of the continued need for training, security, service functionality as seen by the users, details on tools, and build/infrastructure comments. We also highlight the most commonly stated concerns that emerged from these interviews.

J. Utke, "OpenAD: Algorithm Implementation User Guide," Technical Memorandum ANL/MCS-TM-274, April 2004. Research in automatic differentiation has led to a number of tools that implement various approaches and algorithms for the most important programming languages.  While all these tools have the same mathematical underpinnings, the actual implementations have little in common and mostly are specialized for a particular programming language, compiler internal representation, or purpose.  This specialization does not promote an open test bed for experimentation with new algorithms that arise from exploiting structural properties of numerical codes in a source transformation context.  OpenAD is being designed to fill this need by providing a framework that allows for relative ease in the implementation of algorithms that operate on a representation of the numerical kernel of a program.  Language independence is achieved by using an intermediate XML format and the abstraction of common compiler analyses in OpenAnalysis.  The intermediate format is mapped to concrete programming languages via two front/back end combinations.  The design allows for reuse and combination of already implemented algorithms.  We describe the set of algorithms and basic functionality currently implemented in OpenAD and explain the necessary steps to add a new algorithm to the framework.
Preprints
R. Padmanabhan, W. McCune, "Tarski Theorems on Self-Dual Equational Bases for Groups," Preprint ANL/MCS-P1092-0903, August 2004. We present independent self-dual equational bases of arbitrarily large finite sizes for the equational theory of groups treated as varieties of various well-known types. Here the dual of a term f is the mirror reflection of f.  For each type of group theory, we provide an independent self-dual basis with n
identities for n = 2, 3, 4. Then we develop a simple algorithmic procedure to construct independent self-dual equational bases of arbitrary finite sizes in such a way that the new larger equational bases depend explicitly on the initial bases of small sizes. Applying this ``expansion'' procedure, we show that every finitely based variety of groups can be defined by an independent
self-dual set of n identities for all n >= 2. Apart from generalizing the various theorems of Alfred Tarski who initiated this topic in the late 1960s, these proofs also provide explicitly the bases and hence may be construed as the first constructive proof of Tarski's theorems as well.
A. Zagaris, H. G. Kaper, and T. J. Kaper, "Fast and Slow Dynamics for the Computational Singular Perturbation Method," SIAM J. Multiscale Model.Simul. 2 (2004),  pp. 613-638.  Also Preprint ANL/MCS-P1114-0104, January 2004.

 

The Computational Singular Perturbation (CSP) method of Lam and Goussis is an iterative method to reduce the dimensionality of systems of ordinary differential equations with multiple time scales. In [J. Nonlin. Sci., to appear], the authors showed that each iteration of the CSP algorithm improves the approximation of the slow manifold by one order.  In this paper, it is shown that the CSP method simultaneously approximates the tangent spaces to the fast fibers along which solutions relax to the slow manifold. Again, each iteration adds one order of accuracy.  In some studies, the output of the CSP algorithm is postprocessed by linearly projecting initial data onto the slow manifold along these approximate tangent spaces.  These projections, in turn, also become successively more accurate.
X. Zhang and J. M. Schopf, "Performance Analysis of the Globus Toolkit Monitoring and Discovery Service, MDS2," Preprint ANL/MCS-P1115-0104, January 2004. Monitoring and information services form a key component of a distributed system, or Grid.  A quantitative study of such services can aid in understanding the performance limitations, advise in the deployment of the monitoring system, and help evaluate future development work.  To this end, we examined the performance of the Globus Toolkit® Monitoring and Discovery Service (MDS2) by instrumenting its main services using NetLogger.  Our study shows a strong advantage to caching or prefetching the data, as well as the need to have primary components at well-connected sites.
M. Grimsditch, L. Giovannini, F. Montoncello, F. Nizzoli, G. K. Leaf, H. G. Kaper, "Normal Modes in Ferromagnetic Nanoparticles: A Dynamical Matrix Approach," Preprint ANL/MCS-P1116-0104, January 2004. We present a method to compute the magnetic normal modes of a ferromagnetic particle.  The method is a hybrid of micromagnetic simulations and a "dynamical matrix" approach similar to that used for vibrational studies.  We use the method to calculate the normal modes of an Fe parallelepiped and compare the results with the modes recently extracted from a purely micromagnetic simulation.  The results of the two approaches are in excellent agreement.  We discuss the pros and cons of both approaches.  We also present information on standing waves with wavevector perpendicular to the applied field and on a family of modes localized at the particle ends.
H.Gao, I.R.Judson, T.Uram, T.L.Disz, M.E.Papka, R.L.Stevens, "Capability Matching of Data Streams with Network Services," Preprint ANL/MCS-P1118-0104, January 2004. Distributed computing middleware needs to support a wide range of resource, such as diverse software components, various hardware devices, and heterogeneous operating systems and architectures.  Current technologies are unable to implement a maintenance-free platform to be compatible with such different computing environments.  This situation is presenting an increasing challenge as Grid computing becomes more widespread.  The infrastructure of network services (CENSA and CENSI) has been proposed to address this challenge.  A seamless Grid computing environment, supported by network services, is composed of various streams such as data, video, audio, and text.  We define a mathematical model of capability matching for three-party agreements: requests from users, resources, and network services.  Based on the mathematical model, we provide a general approach for capability matching.  We also present a new language schema for capability description.  As an example, we embed the general matchmaker in the architecture of the Access Grid.  Several tests of accuracy and performance are discussed.
W. Jiang, J. Liu, H.-W. Jin, D. K. Panda, W. Gropp, and R. Thakur, "High Performance MPI-2 One-Sided Communication over InfiniBand," Preprint ANL/MCS-P1119-0104, January 2004. Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations.  Although this approach can achieve good portability, it suffers from high communication overhead and dependency on remote process for communication progress.  To address these problems, we propose a high performance MPI-2 one-sided communication design over the InfiniBand Architecture.  In our design, MPI-2 one-sided communication operations such as MPI_Put, MPI_Get, and MPI_Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations.  Our design has been implemented based on MPICH2 over InfiniBand.  We present detailed design issues for this approach and perform a set of micro-benchmarks to characterize different aspects of its performance.  Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77% and reduce latency and synchronization overhead up to 19% and 13%, respectively.  Under certain process skew, the bad impact can be significantly reduced by new design, from 41% to nearly 0%.  It also can achieve better overlap of communication and computation.
P. M. Lyster, J. Guo, T. Clune, and J. W. Larson, "The Computational Complexity and Parallel Scalability of Atmospheric Data Assimilation Algorithms," J. Atmospheric and Oceanic Technology 21 (2004), pp. 1689-1700.  Also Preprint ANL/MCS-P1120-0104, January 2004. This paper quantifies the computational complexity and parallel scalability of two algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Global Modeling and Assimilation Office (GMAO).  The first,
the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS).  GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research.  The second, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS.  For atmospheric assimilation, the gridded dynamical fields typically have more than 106 variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem will require petaflop s-1 computing to achieve effective throughput for scientific research.
W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda, "Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-based Collective Message Passing Protocol," Preprint ANL/MCS-P1121-0204, February 2004. Modern interconnects often have programmable processors in the network interface that can be utilized to offload communication processing from host CPU.  In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol.  With barrier as an initial case study, we have demonstrated that much of the communication processing can be greatly simplified with this collective protocol.  Accordingly, we have designed and implemented efficient and scalable NIC-based barrier operations over two high performance interconnects. Quadrics and Myrinet.

Our evaluation shows that, over a Quadrics cluster of 8 nodes with Elan3 Network, the NIC-based barrier operation achieves a barrier latency of only 5.60µs.  This result is a 2.48 factor of impro9vement over the Elanlib tree-based barrier operation.  Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20µs over 8 nodes is achieved.  This is a 2.64 factor of improvement over the host-based barrier algorithm.  Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13µs latency over Quadrics and with 38.94µs latency over Myrinet.  These results indicate the potential for developing high performance communication subsystems for next generation clusters.

J. Wu, P. Wyckoff, D. Panda, and R. Ross, "Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand," Preprint ANL/MCS-P1122-0204, February 2004. The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks.  However, they pose challenges in integration and interaction among storage server application components and system components.  In this paper, we put forward a component, called Unifier, to provide more efficient integration and better interaction among these components.  Unifier has three notable features.  (1) Unifier integrates cache management and communication buffer management.  It offers a single copy data sharing among all components in a server application safely and concurrently.  (2) It reduces memory registration and deregistration costs to enable applications to take full advantage of RDMA operations.  (3) It provides means to achieve adaptation, application-specific optimization, and better cooperation among different components.  This paper presents the design and implementation of Unifier.  This component has been deployed and evaluated in a version of PVFS1 implementation over InfiniBand.  Experimental results show performance improvements between 30% and 70% over other approaches.  Better scalability is also achieved by the PVFS I/O servers.
P. M. Dickens and J. W. Larson, "Classifiers for the Causes of Data Loss Using Packet-Loss Signatures," Preprint ANL/MCS-P1123-0204, February 2004. A necessary step in the development of next-generation congestion control mechanisms is the ability to accurately classify the root cause(s) of observed data loss and to develop responses tailored to the particular cause.  Toward this end, we are developing a classification mechanism based on the collection and analysis of what we term packet-loss signatures, which describe the patterns of packet loss in the current transmission window.  We are exploring the application of complexity theory to the problem of learning the underlying structure (or lack thereof) of these signatures, and studying the relationship between such underlying structure and the system conditions responsible for its generation.  In this paper, we describe the algorithm for determining the complexity of packet-loss signatures, show how complexity measures can be mapped to the underlying causes of packet loss, and provide experimental results demonstrating the effectiveness of our approach.
A. Ching, A. Choudhary, W.-K. Liao, R. Ross, W. Gropp, "Evaluatring Structured I/O Methods for Parallel File Systems," IJHPCN 2(2/3/4), (2004), pp. 133-145.  Also Preprint ANL/MCS-P1125-0204, February 2004. Modern data-intensive structured datasets constantly undergo manipulation and migration through parallel scientific applications.  Directly supporting these time consuming operation is an important step in providing high-performance I/O solutions for modern large-scale applications.  High-level interfaces such as HDF56 and Parallel netCDF provide convenient APIs for accessing structured datasets, and the MPI-IO interface also supports efficient access to structured data.  Parallel file systems do not traditionally support such structured access from these higher level interfaces.  In this work we present two contributions.  First, we demonstrate an implementation of structured data access support in the context of the Parallel Virtual File System (PVFs).  We call this support "datatype I/O" because of its similarity to MPI datatypes.  This support is built by using a reusable datatype-processing component from the MPICH2 MPI implementation.  The second contribution of this work is a comparison of I/O characteristics of modern high-performance noncontiguous I/O methods.  We use our I/O characteristics comparison to assess all the methods using three test applications.  We also point to further optimizations that could be leveraged for even more efficient operation.
G. von Laszewski, J. Gawor, P. Plaszczak, M. Hategan, K. Amin, R. Madduri, S. Gose, S. Shankar, "An Overview of Grid File Transfer Patterns and Their Implementation in the Java CoG Kit," J. Neural Parallel and Scientific Computations 12(3) (2004), pp. 329-352.  Also Preprint ANL/MCS-P1126-1203, February 2004. Accessing files on remote resources is a required function in Grids.  In this paper, we report on the file transfer patterns supported in the Java CoG Kit.  These patterns are supported by a rich set of accompanying components, including Java classes and methods, command line tools, graphical user interfaces, and portals.  The patterns and their implementations are exposed through familiar Java language capabilities of interfaces, hence hiding the underlying protocols.  Using these interfaces, one can provide a variety of implementations for diverse file transfer mechanisms and protocols.  Together, these tools can be used to implement more sophisticated services.  We present a number of prototype applications reusing the Java CoG Kits file transfer patterns.  Additionally, we present performance numbers based on a typical client deployment scenario.
R. Al-Ali, G. von Laszewski, K. Amin, M. Hategan, O. Rana, D. Walker, N. Zaluzec, "QoS Support for High-Performance Scientific Grid Applications," Preprint ANL/MCS-P1129-0204, February 2004. The Grid approach provides the ability to access and use distributed resources as part of virtual organizations.  The emerging Grid infrastructure gives rise to a class of scientific applications and services to support collaborative and distributed resource-sharing requirements as part of teleimmersion, visualization, and simulation services.  Because such applications operate in a collaborative mode, data must be stored and delivered in a timely manner to meet deadlines.  Hence, this class of applications has stringent real-time constraints and quality-of-service (QoS) requirements.  A QoS management approach is required to orchestrate and guarantee the interaction between such applications and services.  In this paper we discuss the design and prototype implementation of a QoS system and show how we enable Grid applications to become QoS compliant.  We validate this approach through a case study of nano materials.  Our approach enhances the current Open Grid Services Architecture.  We demonstrate the usefulness of the approach on a nano materials application.
W. Allcock, J. Bresnahan, R. Kettimuthu, J. Link, "The Globux eXtensible Input/Output System (XIO): A protocol independent IO system for the Grid," Preprint ANL/MCS-P1130-0204, February 2004. In distributed heterogeneous Grid environments, the protocols used to exchange bits are crucial.  As researchers work hard to discover the best new protocol for the Grid, application developers struggle with ways to use these new protocols.  A stable, consistent, and intuitive framework is needed to aid in the implementation and use of these protocols.  While the application must not be burdened with the protocol details, some of it may need to be exposed to take advantage of potential optimizations.  In this paper we examine how the Globus XIO API provides this framework.  We will explore the performance implications of using this abstraction layer and the benefits gained in application as well as protocol development.
B. Norris, J. Ray, R. Armstrong, L. C. McInnes, D. E. Bernholt, W. R. Elwasif, A. D. Malony, S. Shende, "Computational Quality of Service for Scientific Components," Preprint ANL/MCS-P1131-0304, March 2004. Scientific computing on massively parallel computers presents unique challenges to component-based software engineering (CBSE).  While CBSE is at least as enabling for scientific computing as it is for other arenas, the requirements are different.  We briefly discuss how these requirements shape the Common Component Architecture, and we describe some recent research on quality-of-service issues to address the computational performance and accuracy of scientific simulations.
V. Welch, I. Foster, C. Kesselman, O. Mulmo, L. Pearlman, S. Tuecke, J. Gawor, S. Meder, F. Siebenlist, "X.509 Proxy Certificates for Dynamic Delegation," Preprint ANL/MCS-P1132-0204, February 2004. Proxy credentials are commonly used in security systems when one entity wishes to grant to another entity some set of its privileges.  We have defined and standardized X.509 Proxy Certificates for the purpose of providing restricted proxying and delegation within a PKI-based authentication system.  We present here our motivations for this work coming from our efforts in Grid security, the Proxy Certificate itself, and our experiences in implementation and deployment.
D.P. Schissel, K.Keahey, T.Araki, J.R. Burruss, E.Feibush, S.M.Flanagan, T. W. Fredian, M. J. Greenwald, S. A. Klasky, T. Leggett, K. Li, D. C. McCune, P. Lane, M. E. Papka, Q. Peng, L. Randerson, A.Sanderson, J.Stillerman, M. R. Thompson, G. Wallace, "The National Fusion Collaboratory Project: Applying Grid Technology for Magnetic Fusion Research," Preprint ANL/MCS-P1133-0304, March 2004. The overall goal of the DOE SciDAC funded U.S. National Fusion  Collaboratory Project (http://www.fusiongrid.org/) is to improve the productivity of fusion sciences research through the development and deployment of advanced software tools that reduce the technical barriers to collaboration and sharing on a national scale. Our vision is to make resources—data, computers along with analysis, simulation and visualization codes—widely and transparently available as network accessible services, thereby enabling real-time multi-institutional collaboration on fusion experiments and improving comparisons between experiments and theory. The project unites fusion and computer science researchers to develop and deploy a national Fusion Energy Sciences Grid (FusionGrid), a system for secure sharing of computation, visualization, and data resources over the Internet. Some of the early prototype FusionGrid services proved so successful they are now used in a production environment for everyday fusion research. The success of other prototype services has resulted in fusion research funds being used to purchase necessary hardware to create production services. Future work is planned to augment the services currently available with FusionGrid, transition prototype services to production capability, and put in place the infrastructure to support a larger user base.
T. Munson, "An Averaging Method for Nash Games with Shared Decision Variables," Preprint ANL/MCS-P1134-0304, March 2004. This short communication concerns a class of Nash games in which the participants share some of the decision variables.  An averaging method is developed in order to transform these games into standard Nash games.  This technique is successfully applied to solve an example multileader, single-follower game described by Pang and Fukushima.
T. Munson, "Mesh Shape-Quality Optimization Using the Inverse Mean-Ratio Metric," Preprint ANL/MCS-P1136-0304, March 2004. Meshes containing elements with bad quality can result in poorly conditioned systems of equations that must be solved when using a discretization method, such as the finite-element method, for solving a partial differential equation.  Moreover, such meshes can lead to poor accuracy in the approximate solution computed.  In this paper, we present a nonlinear fractional program that relocates the vertices of a given mesh to optimize the average elemen5t shape quality as measured by the inverse mean-ratio metric.  To solve the resulting large-scale optimization problems, we apply an efficient implementation of an inexact Newton algorithm using the conjugate gradient method with a block Jacobi preconditioner to compute the direction.  We show that the block Jacobi preconditioner is positive definite by proving a general theorem concerning the convexity of fractional functions, applying this result to components of the inverse mean-ratio metric, and showing that each block in the preconditioner is invertible.  Numerical results obtained with this special-purpose code on several test meshes are presented and used to quantify the impact on solution time and memory requirements of using a modeling language and general-purpose algorithm to solve these problems.
A. Kawano, Y. Guo, D. L. Thompson, A. F. Wagner, M. Minkoff, "Improving the Accuracy of Interpolated Potential Energy Surfaces by Using an Analytical Zeroth-Order Potential Function," Journal of Chemical Physics 120, No. 14 (2004), pp. 6414-6422.  Also Preprint ANL/MCS-P1137-0304, March 2004. We present a method for improving the accuracy and efficiency of interpolation methods, in which an analytical zeroth-order potential-energy surface is employed as a reference surface.  To investigate and test the method, we apply it to hydrogen peroxide where there exists an accurate analytical surface which we take as the "exact" surface for obtaining the energies and derivatives for fitting and assessing the accuracy.  Examples are given for four-dimensional and six-dimensional surfaces interpolated by using either the modified Shepard or second-degree interpolating moving least-squares approach, with comparisons for cases with and without using the zeroth-order potential.
Y. Zhou, R. Shepard, and M. Minkoff, "Computing Eigenvalue Bounds for Iterative Subspace Matrix Methods," Preprint ANL/MCS-P1138-0304, March 2004. A procedure is presented for the computation of bounds to eigenvalues of the generalized hermitian eigenvalue problem and to the standard hermitian eigenvlaue problem.  This procedure is applicable to iterative subspace eigenvalue methods and to both external and interior eigenvalues.  The Ritz values and their corresponding residual norms, all of which are computable quantities, are needed by the procedure.  Knowledge of the exact eigenvalues is not needed by the procedure, but it must be known that the computed Ritz values are isolated from exact eigenvalues within the Ritz spectrum range.  A multipass refinement procedure is described to compute the bounds for each Ritz value.  This procedure requires O(m) effort where m is the subspace dimension for each pass.
K. Keahey, M. E. Papka, Q. Peng, D. Schissel, G. Abla, T. Araki, J. Burruss, S. Feibush, P. Lane, S. Klasky, T. Leggett, D. McCune, and L. Randerson, "Grids for Experimental Science: The Virtual Control Room," Preprint ANL/MCS-P1139-0304, March 2004. The National Fusion Collaboratory focuses on enabling fusion scientists to explore Grid capabilities in support of experimental science.  Fusion experiments are structured as a series of plasma pulses initiated roughly every 20 minutes.  In the between-pulse intervals, scientists perform data analysis and discuss results to reach decisions affecting changes to the next plasma pulse.  This interaction can be made more efficient by performing more analysis and engaging more expertise from a geographically distributed team of scientists and resources.  In this paper, we describe a virtual control room experiment that unites collaborative, visualization, and Grid technologies to provide such environment and shows how their combined effect can advance experimental science.  We also report on FusionGrid services whose use during the fusion experimental cycle became possible for the first time thanks to this technology.  We also describe the Access Grid, experimental data presentation tools, and agreement-based resource management and workflow systems enabling time-bounded end-to-end application execution.  The first virtual control room experiment represented a mock-up of a remote interaction with the DIII-D control room and was presented at SC03 and later reviewed at an international ITER Grid Workshop.
R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," Preprint ANL/MCS-P1140-0304, March 2004. We describe our work on optimizing the collective communication operations in MPICH for clusters connected by switched networks.  For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages.  Although we have implemented new algorithms for all MPI collective operations, because of limited space we describe only the algorithms for all-gather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce.  Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP.  We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and non-power-of-two numbers of processes.  The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E.  This work demonstrates that to achieve the best performance for a collective communication operation, we need to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
K. Keahey, K. Doering, and I. Foster, "From Sandbox to Playground: Dynamic Virtual Environments in the Grid," Preprint ANL/MCS-P1141-0304, March 2004. Although the latest definition of the Grid emphasizes the delivery of nontrivial quality of service (QoS), to date there has been little work on providing such QoS.  Recent advances in virtualization and related technologies, as well as the developments of Grid protocols have paved the way for applying virtualization technologies to the Grids in nontrivial and interesting ways.  This situation presents a new opportunity to conceptually model the Grid as a set of distributed virtual execution environments (sandboxes or virtual machines): a virtual playground.  Such a virtual playground would enable a Grid user to approach problem-solving in the Grids in a new way: rather than fit the problem onto available resources a user would request the virtual resources needed to solve the problem and rely on scheduling software to implement the playground in the Grid.  While this is an appealing vision, much work needs to be done to make it a reality.  In this paper, we describe one modest step toward its fulfillment: investigating the protocols and technologies for the implementation of dynamic virtual environments (DVEs), allowing a client and manage remote execution environments.
K. Keahey, T. Araki, and P. Lane, "Agreement-Based Interactions for Experimental Science," Preprint ANL/MCS-P1142-0304, March 2003. Enabling quality of service in Grids requires not only resource management strategies but also the development of protocols enabling structured negotiation for the use of resources.  In this paper, we describe design, implementation, and application of an agreement-based infrastructure.  We then discuss its use in the virtual control room developed for the National Fusion Collaboratory.
M. Anitescu, "Global Convergence of an Eleastic Mode Approach for a Class of Mathematical Programs with Complementarity Constraints," Preprint ANL/MCS-P1143-0404, April 2004. We prove that any accumulation point of an elastic mode approach, applied to the optimization of a mixed P variational inequality, that approximately solves the relaxed subproblems is a C-stationary point of the problem of optimizing a parametric mixed P variational inequality.  If, in addition, the accumulation point satisfies the MPCC-LICQ constant qualification and if the solutions of the subproblem satisfy approximate second-order sufficient conditions, then the limiting point is an M-stationary point.  Moreover, if the accumulation point satisfies the upper-level strict complementarity condition, the accumulation point will be a strongly stationary point.  If we assume that the penalty function associated with the feasible set of the mathematical program with complementarity constraints has bounded level sets and if the objective function is bounded below, we show that the algorithm will produce bounded iterates and will therefore have at least one accumulation point.  We prove that the obstacle problem satisfies our assumptions for both a rigid and a deformable obstacle.  The theoretical conclusions are validated by several numerical examples.
M. R. Paul, K-H. Chiam, M. C. Cross, and P. F. Fischer, "Rayleight-Bénard Convection in Large-Aspect-Ratio Domains," Preprint ANL/MCS-P1144-0404, April 2004. The coarsening and wavenumber selection of striped states growing from random initial conditions are studied in a non-relaxational, spatially extended, and far-from-equilibrium system by performing large-scale numerical simulations of Rayleigh-Bénard convection in a large-aspect-ratio cylindrical domain with experimentally realistic boundaries.  We find evidence that various measures of the coarsening dynamics scale in time with different power-law exponents, indicating that multiple length scales are required in describing the time dependent pattern evolution.  The translational correlation length scales with time as t0.12, the orientational correlation length scales as t0.54, and the density of defects scale as t-0.45.  The final pattern evolves toward the wavenumber where isolated dislocations become motionless, suggesting a possible wavenumber selection mechanism for large-aspect-ratio convection.
U. Naumann, J. Utke, A. Lyons, and M. Fagan, "Control Flow Reversal for Adjoint Code Generation," Preprint ANL/MCS-P1145-0404, April 2004. We describe an approach to the reversal of the control flow of structured programs.  It is used to automatically generate adjoint code for numerical programs by semantic source transformation.  After a short introduction to applications and the implementation tool set, we describe the building blocks using a simple example.  We then illustrate the code reversal within basic blocks.  The main part of the paper covers the reversal of structured control flow graphs.  We show the algorithmic steps for simple branches and loops and give a detailed algorithm for the reversal of arbitrary combinations of loops and branches in a general control flow graph.
L. Wos, "Milestones for Automated Reasoning," International Journal on Artificial Intelligence Tools 15, No. 1, pp. 3-19.  Also Preprint ANL/MCS-P1146-0404, April 2004. In the beginning (the early 1960s), the long-term goal of automated deduction was the design and implementation of a program whose use would lead to "real'' and significant contributions to mathematics by offering sufficient power for the discovery of proofs. The realization of that goal appeared to be at least six decades in the future. However, with amazement and satisfaction, we can report that less than four decades were required.
In this article, we present evidence for this claim, thanks to W. McCune's program OTTER. Our focus is on various landmarks, or milestones, of two types. One type concerns the formulation of new strategies and methodologies whose use greatly enhances the power of a reasoning program.  A second type focuses on actual contributions to mathematics and (although not initially envisioned) to logic. We give examples of each type of milestone, and, perhaps of equal importance, demonstrate that advances are far more likely to occur if the two classes are indeed intertwined. We draw heavily on material presented in great detail in the new book  Automated Reasoning and the Discovery of Missing and Elegant Proofs,  published by Rinton Press.
M. Hereld, R. L. Stevens, W. van Drongelen, H. C. Lee, "Developing a Petascale Neural Simulation," Preprint ANL/MCS-P1147-0404, April 2004. Simulations of large neural networks have the potential to contribute uniquely to the study of epilepsy, from the effects of extremely local changes in neuron environment and behavior to the effects of large-scale wiring anomalies.  Currently, however, simulations with sufficient detail in the neuron model are limited to cell counts that are far smaller than scales measured by typical probes.  Furthermore, future simulations are likely to follow the path of large-scale simulations in other fields and include hierarchically interacting components covering different scales and different biophysics.  Problem solving in this domain calls for petascale computing---computing with supercomputers capable of performing 1015 operations a second and holding datasets pf 1015 bytes in memory.  We present the structure of our simulation of epileptiform electrical activity in the neocortex, describe experiments and models of its scaling behavior in large cluster supercomputers, identify tight spots in this behavior, and project the performance onto a candidate next-generation computing platform.
J. P. Kenny, S .J. Benson, Y. Alexeev, J. Sarich, C. L. Janssen, L. C. McInnes, M. Krishnan, J. Nieplocha, E. Jurruss, C. Fahlstrom, T. L. Windus, "Component-Based Integration of Chemistry and Optimization Software,"  J. Comput. Chem. 25, no. 14 (2004), pp. 1717-1725.  Also Preprint ANL/MCS-P1148-0404, April 2004. Typical scientific software designs make rigid assumptions regarding programming language and data structures, frustrating software interoperability and scientific collaboration.  Component-based software engineering is an emerging approach to managing the increasing complexity of scientific software.  Component technology facilitates code interoperability and reuse.  Through the adoption of methodology and tools developed by the Common Component Architecture Forum, we have developed a component architecture for molecular structure optimization.  Using the NWChem and Massively Parallel Quantum Chemistry pages, we have produced chemistry components that provide capacity for energy and energy derivative evaluation.  We have constructed geometry optimization applications by integrating the Toolkit for Advanced Optimization, Portable Extensible Toolkit for Scientific Computation, and Global Arrays packages, which provide optimization and linear algebra capabilities.  We present a brief overview of the component development process and a description of abstract interfaces for chemical optimizations.  The components conforming to these abstract interfaces allow the construction of applications using different chemistry and mathematics packages interchangeably.  Initial numerical results for the component software demonstrate good performance and highlight potential research enabled by this platform.
N. Desai, R. Bradshaw, and E. Lusk, "Component-Based Cluster Systems Software Architecture: A Case Study," Preprint ANL/MCS-P1149-0404, April 2004. We describe the use of component architecture in an area to which this approach has not been classically applied, the area of cluster system software.  By "cluster system software," we mean the collection of programs used in configuring and maintaining individual nodes, together with the software involved in submission, scheduling, monitoring, and termination of jobs.  We describe how the component approach maps onto the cluster systems software problem, together with our experiences with the approach in implementing an all-new suite of systems software for a medium-sized cluster with unusually complex systems software requirements.
A.-V. Demiguel, M. P. Freidlander, F. J. Nogales, and S. Scholtes, "An Interior-Point Method for MPECs Based on Strictly Feasible Relaxations," SIAM J. Optimization 16(2) (2005), pp. 587-609.  Title in journal changed to "A Two-Sided Relaxation Scheme for Mathematical Programs with Equilibrium Constraints."  Also Preprint ANL/MCS-P1150-0404, April 2004. An interior-point method for solving mathematical programs with equilibrium constraints (MPECs) is proposed. At each iteration of the algorithm, a single primal-dual step is computed from each subproblem of a sequence. Each subproblem is defined as a relaxation of the MPEC with a nonempty strictly feasible region. In contrast to previous approaches, the proposed relaxation scheme preserves the nonempty strict feasibility of each subproblem even in the limit. Local and superlinear convergence of the algorithm is proved even with a less restrictive strict complementarity condition than the standard one. Moreover, mechanisms for inducing global convergence in practice are proposed. Numerical results on the MacMPEC test problem set demonstrate the fast-local convergence properties of the algorithm.
N. T. Karonis, M. E. Papka, J. Binns, J. Bresnahan, J. M. Link, "Effective Use of Dedicated Wide-Area Networks for High-Performance Distributed Computing," Preprint ANL/MCS-P1151-0404, April 2004. Recent advances in Grid technology have made it possible to build so-called computational Grids, or simply Grids, which couple unique or rare resources that are geographically separated and span multiple administrative domains.  Such Grids are invariably composed of heterogeneous networks in which, at the least, a high-performance switch accommodates intracluster messages and a separate, sometimes dedicated, high-bandwidth network serving intersite messages across the wide area.  While such wide-area networks provide unprecedented bandwidth capacity and reliability, the effective utilization of these networks remains an open challenge.  Most applications by default use the TCP/IP protocol for its ease of use and reliability, but the high bandwidth and high latency sometimes found on these networks induce enormous bandwidth delay products that result in extremely large TCP congestion window sizes.  This situation makes TCP a poor choice for data-intensive applications striving to achieve maximum bandwidth utilization on high-performance networks.  To address this bandwidth utilization challenge for Grids connected over dedicated networks, we present a solution based on the UDP protocol with added reliability and the Message Passing Interface (MPI) standard.  MPI provides an interface that allows application programmers to ignore network heterogeneity.  To study the efficacy of our approach, we implemented our implementation of the Reliable-Blast UDP protocol in MPICH-G2, our Grid-enabled MPI.  We demonstrated this implementation in an MPI data-intensive Grid visualization application on the NSF TeraGrid and its dedicated high-bandwidth fiber optic network.  We observed an improvement in aggregate bandwidth utilization from 58 Mbps with MPICH-G2 using TCP alone to 9 Gbps with our technique.
C. H. Bischof, P. D. Hovland, and B. Norris, "On the Implementation of Automatic Differentiation Tools," Preprint ANL/MCS-P1152-0404, April 2004. Automatic differentiation is a semantic transformation that applies the rules of differential calculus to source code.  It thus transforms a computer program that computes a mathematical function into a program that computes the function and its derivatives.  Derivatives play an important role in a wide variety of scientific computing applications, including numerical optimization, solution of nonlinear equations, sensitivity analysis, and nonlinear inverse problems.  We describe the forward and reverse modes of automatic differentiation and provide a survey of implementation strategies.  We describe some of the challenges in the implementation of automatic differentiation tools, with a focus on tools based on source transformation.  We conclude with an overview of current research and future opportunities.
S. L. Lee and P. D. Hovland, "Derivative Techniques for Forward Sensitivity Analysis in Time-Dependent Problems," Preprint ANL/MCS-P1153-0404, April 2004. The complex-step derivative approximation technique is a highly accurate and convenient method for computing directional derivatives within simulation codes.  The method is similar to finite-difference approximations and automatic differentiation techniques in terms of ease of implementation and accuracy, respectively.  We examine the performance and accuracy of finite-difference, automatic differentiation, and complex-step techniques in analyzing the solutions to physical systems modeled as initial-value problems in ordinary differential equations (ODEs).  In particular, we investigate the use of these derivative techniques in computing the sensitivity of the ODE solutions with respect to various model parameters.  In doing so, we identify the strengths and weaknesses of the derivative techniques and find that the derivify implementation of automatic differentiation can be a simple, accurate, and cost-effective means of computing directional derivatives for forward sensitivity analysis.
W. Gropp and E. Lusk, "Fault Tolerance in MPI Programs," Preprint ANL/MCS-P1154-0404, April 2004. This paper examines the topic of writing fault-tolerant MPI applications.  We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it.  We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI program, modifying MPI semantics, and extending the MPI specification.  We conclude that within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
E. D. Dolan, J. J. More', and T. S. Munson, "Optimality Measures for Performance Profiles," SIAM J. Optimization 16(3) (2006), pp. 891-909.  Also Preprint ANL/MCS-P1155-0504, May 2004. We examine the importance of optimality measures when benchmarking a set of solvers, and show that scaling requirements lead to a convergence test for nonlinearly constrained optimization solvers that uses a mixture of absolute and relative error measures. We demonstrate that this convergence test is well behaved at any point where the constraints satisfy the Mangasarian-Fromovitz constraint qualification and also avoids the explicit use of a complementarity measure.  Computational experiments explore the impact of this convergence test on the benchmarking process with performance profiles.
G. von Laszewski, G. W. Pieper, and P. Wagstrom, "Gestalt of the Grid," in Performance Evaluation and Characterization of Parallel and Distributed Computing Tools, eds. S. Hariri and M. Parashar, Chapter 6, John Wiley & Sons, Inc. (to appear).  Also Preprint ANL/MCS-P1156-0504, April 2004.  The Grid approach is an important development in the discipline of computer science and engineering.  Rapid progress is being made on several levels, including the definition of terminology, the design of an architecture and framework, the application in the scientific problem solving process, and the creation of physical instantiations of Grids on a production level.  This chapter provides an overview of important influences, developments, and technologies that are shaping state-of-the-art Grid computing.  In particular, we address the following questions:  What motives the Grid approach?  What is a Grid?  What is the architecture of a Grid?  Which Grid research activities are performed?  How do researchers use a Grid?  What will the future bring?
J. Peng, M. Anitescu, and S. Akella, "Optimal Control of Multiple Robot Systems with Friction using MPCC," Preprint ANL/MCS-P1157-0404, April 2004. This paper studies optimal control of multiple robot systems with frictional contact.  The robots have nonlinear dynamics, which may arise from the robot body dynamics, friction between robot and environment, and friction between robot and robot.  Nonpenetration constraints between robots are imposed, and the robots are assumed rigid.  The problem is modeled as a mathematical program with complementarity constraints (MPCC).  The MPCC model is solved using its elastic mode, which is a well-behaved nonlinear programming problem.  Preliminary results with this approach are illustrated on example problems.  The main contributions of this paper are: (1) a novel optimal control model that can deal with friction in the multiple robot system; (2) application of a new mathematical programming algorithm to solve the MPCC model effectively.  The coordination model has potential applications in robot systems with friction, such as multi-finger manipulation, manipulation with ropes, and multi-robot pushing coordination.
R. Thakur, W. Gropp, and B. Toonen, "Minimizing Synchronization Overhead in the Implementation of MPI One-Sided Communication," Preprint ANL/MCS-P1158-0504, May 2004. The one-sided communication operations in MPI are intended to provide the convenience of directly accessing remote memory and the potential for higher performance than regular point-to-point communication.  Our performance measurements with three MPI implementations (IBM MPI, Sun MPI, and LAM) indicate, however, that one-sided communication can perform much worse than point-to-point communication if the associated synchronization calls are not implemented efficiently.  In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH-2.  We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock.  Our performance results demonstrate that, for short messages, MPICH-2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.
L. F. Diachin, P. Knupp, T. Munson, and S. Shontz, "A Comparison of Inexact Newton and Coordinate Descent Mesh Optimization Techniques," Preprint ANL/MCS-P1159-0504, May 2004. We compare inexact Newton and coordinate descent methods for optimizing the quality of a mesh by repositioning the vertices, where quality is measured by the harmonic mean of the mean-ratio metric.  The effects of problem size, element size heterogeneity, and various vertex displacement schemes on the performance of these algorithms are assessed for a series of tetrahedral meshes.
N. Desai, R. Bradshaw, A. Lusk, and E. Lusk, "MPI Cluster System Software," in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proc. 2004 EuroPVM/PMP, Budapest, Sept. 14-15, 2004 (to appear).  Also Preprint ANL/MCS-P1160-0504, May 2004. We describe the use of MPI for writing system software and tools, an area where it has not been previously applied. By ``system software'' we mean collections of tools used for system management and operations. We describe the common methodologies used for system software development, together with our experiences in implementing three items of system software with MPI. We demonstrate that MPI can bring significant performance and other benefits to system software.
M. Anitescu, "Optimization-Based Simulation of Nonsmooth Rigid Multibody Dynamics," Preprint ANL/MCS-P1161-0504, May 2004. We present a time-stepping method to simulate rigid multibody dynamics with inelastic collision, contact, and friction.  The method progresses with fixed time step without backtracking for collision and solves at every step a strictly convex quadratic program.  We prove that a solution sequence of the method converges to the solution of a measure differential inclusion.  We present numerical results for a few examples, and we illustrate the difference between the results from our scheme and previous, linear-complementarity-based time-stepping schemes.
F. A. Potra, M. Anitescu, B. Gavrea, and J. Trinkle, "A Linearly Implicit Trapezoidal Method for Integrating Stiff Multibody Dynamics with Contact, Joints, and Friction," Preprint ANL/MCS-P1162-0504, May 2004. We present a hard constraint, linear complementarity based, method for the simulation of stiff multibody dynamics with contact, joints, and friction.  The approach uses a linearization of the modified trapezoidal method, incorporates a Poisson restitution model at collision, and solves only one linear complementarity problem per time step when no collisions are encountered.  We prove that the method has order two, a fact that is also demonstrated by our numerical simulations.  When we use a special approximation of the Jacobian matrix for the case where the stiff forces originate in springs and dampers attached to two points in the system, we prove that the linear complementarity problem can be solved for any value of the time step and that the method is stiffly stable.  In particular, the numerical solution converges to the numerical solution of the constraint problem, where the stiff springs and dampers have been replaced by rigid links, a fact that we also demonstrate by numerical simulations.  The method was implemented in UMBRA, an industrial-grade virtual prototyping software.
U. Naumann, J. Utke, and A. Lyons, "Control Flow Reversal for Adjoint Code Generation," Preprint ANL/MCS-P1163-0504, May 2004. We describe an approach to the reversal of the control flow of structured programs.  It is used to automatically generate adjoint code for numerical programs by semantic source transformation.  After a short introduction to applications and the implementation tool set, we describe the building blocks using a simple example.  We then illustrate the code reversal within basic blocks.  The main part of the paper covers the reversal of structured control flow graphs.  We show the algorithmic steps for simple branches and loops and give a detailed algorithm for the reversal of arbitrary combinations of loops and branches in a general control flow graph.
W. Jiang, J. Liu, H.-W. Jin, D. K. Panda, D. Buntinas, R. Thakur, W. Gropp, "Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters," Preprint ANL/MCS-P1164-0504, May 2004. In this paper we compare various design alternatives for synchronization in MPI-2 passive One-sided communication on InfiniBand clusters.  We discuss several requirements for synchronization in passive one-sided communication.  Based on these requirements, we present four design alternatives, which can be classified into two categories:  thread-based and atomic operation-based.  In thread-based designs, synchronization is achieved with the help of extra threads.  In atomic operation-based designs, we exploit InfiniBand atomic operations such as Compare-and-Swap and Fetch-and-Add.  Our performance evaluation results show that the atomic operation-based design can require less synchronization overhead, achieve better concurrency, and consume fewer computing resources compared with the thread-based design.
T. Bae, A. K. Banger, A. Wallace, E. M. Glass, F. Aslund, O. Schneewind, D. M. Missiakas, "Staphylococcus Aureus Virulence Genes Identified by Bursa Aurealis Mutagenesis and Nematode Killing," Proc. Natl. Acad. Sci. U.S.A., 101(33), 12312-317, August 17, 2004.  Also Preprint ANL/MCS-P1165-0604, June 2004. Staphlococcus aureus is the lading cause of wound and hospital-acquired infections world-wide.  The increased emergence of S. aureus straints with resistance to multiple antibiotics requires the identification of bacterial virulence genes and the development of novel therapeutic strategies.  Herein, bursa aurealis, a mariner-based transposon, was used for random mutagenesis and for the isolation of 10,325 S. aureus variants with defeind insertion sites.  By screening for loss of function mutants in a Caenorhabditis elegans killing assay, 71 S. aureus virulence genes were identified.  Some, but not all, of these genes are also required for S. aureus abscess formation in a murine infection model.

W. E. Allcock, I. Foster, R. Madduri, "Reliable Data Transport: A Critical Service for the Grid," Preprint ANL/MCS-P1166-0604, June 2004. Many Grid applications require secure, fast, efficient, and robust data transport mechanisms, for example when moving a large dataset to or from a remote compute server, or when fetching data from several locations for the purpose of integration.
M. M. Strout and P. D. Hovland, "Metrics and Models for Reordering Transformations," Preprint ANL/MCS-P1167-0604, June 2004. Irregular applications frequently exhibit poor performance on contemporary computer architectures, in large part because of their inefficient use of the memory hierarchy.  Run-time data- and iteration-reordering transformations have been shown to improve the locality and therefore the performance of irregular benchmarks.  This paper describes models for determining which combination of run-time data- and iteration-reordering transformations be viewed as approximating minimal linear arrangements on two separate hypergraphs: a spatial locality hypergraph on two separate hypergraphs: a spatial locality hypergraph and a temporal locality hypergraph.  Our results measure the efficacy of locality metrics based on these hypergraphs in guiding the selection of data- and iteration-reordering heuristics.  We also introduce new iteration- and data-reordering heuristics based on the hypergraph models that result in better performance than do previous heuristics.
R. Kettimuthu and W. Allcock, "Improved Selective Acknowledgment Scheme for TCP," Preprint ANL/MCS-P1168-0604, June 2004. A selective acknowledgment (SACK) mechanism, combined with a selective repeat retransmission policy, has been proposed to overcome the limitations with the cumulative acknowledgment scheme in TCP,  With the SACK mechanism, the receiver informs the sender about the noncontiguous blocks of data that have been received and queued.  However, for each such noncontiguous block, SACK requires 8 bytes to convey the information to the sender.  Since TCP options field has a fixed length, an acknowledgment packet, at the maximum, can carry information about only 4 noncontiguous blocks.  Under some error conditions, this limitation can cause the TCP sender to retransmit packets that have already been received successfully by the receiver.  In this paper, we propose an improved selective acknowledgment (ISACK) scheme to overcome the limitations of the current selective acknowledgment scheme.  Using examples, we demonstrate how the proposed scheme works.  We further propose an adaptive selective acknowledgment (ASACK) strategy that dynamically switches between SACK and ISACK to give optimal performance.
R. Overbeek, T. Disz, and R. Stevens, "A Peer-to-Peer Environment for Annotation of Genomes: The SEED," Preprint ANL/MCS-P1169-0604, June 2004. A genome may be thought of as a set of genes that encode protein sequences.  The function of each gene is determined by the activity of the protein it encodes.  Genome annotation is the process of assigning functions to genes.  Functions are assigned by any of several methods.  The most direct form of function assignment involves determining the function of a gene by experiment.  Since vastly more gene sequences are available in genome databases than the number with directly determined functions, most genes are assigned a function by indirect methods.  These methods include assigning a function to a gene based on sequence similarity to genes with known function, assigning a function to a gene based on its position in a conserved gene cluster through comparative analysis of many genomes, and inferring function via other techniques for detection of functional coupling.  Genome annotation is an iterative process that can exploit a variety of domain knowledge sources.  For genes that code for enzymes involved in core metabolism, much is known about the biochemical reaction networks in which the enzymes participate.  The existence of a known reaction pathway (such as those available in biochemistry databases) can provide valuable information that supports inference of function through processes of systematic elimination.  We believe this approach will be highly valuable, especially in comparison to simple similarity methods (which are often unreliable, particularly in the case of paralogous genes, genes that have common ancestry but have evolved divergent functions).
W. van Drongelen, H. C. Lee, M. Hereld, D. Jones, M. Cohoon, F. Elsen, M. E. Papka, and R. L. Stevens, Neurocomputing 58-60 (2004), pp. 1203-1209.  Also Preprint ANL/MCS-P1170-0604, June 2004. A scalable network model intended for study of neocortical epileptiform activity was built on the pGENESIS neural simulator.  The model included superficial and deep pyramidal cells plus four types of inhibitory neurons.  An electroencephalogram (EEG) simulator was attached to the model to validate model behavior and to determine the contributions of inhibitory and excitatory neuronal populations to the EEG signal.  We examined effects of overall excitation and inhibition on activity patterns in the network and found that the network-bursting patterns occur within a narrow range of the excitation-inhibition space.  Further, we evaluated synchronization effects produced by gap junctions during synchronous and asynchronous states.
W. van Drongelen, H. C. Lee, M. Hereld, Z. Chen, F. P. Elsen, and R. L. Stevens, "Emergent Epileptiform Activity in Neural Networks with Weak Excitatory Synapses," IEEE Trans. Neural Syst. Rehab. Eng. 13(2), 2005, pp. 236-241.  Also Preprint ANL/MCS-P1171-0604, June 2004. Brain electrical activity recorded during an epileptic seizure is frequently associated with rhythmic discharges in cortical networks.  Current opinion in clinical neurophysiology is that strongly coupled networks and cellular bursting are prerequisites for the generation of epileptiform activity.  Contrary to expectations, we found that weakly coupled cortical networks can create synchronized cellular activity and seizure-like bursting.  Evaluation of a range of synaptic parameters in a detailed computational model revealed that seizure-like activity occurs when the excitatory synapses are weakened.  Guided by this observation, we confirmed experimentally that, in mouse neocortical slices, a pharmacological reduction of excitatory synaptic transmission elicited sudden onset of repetitive network bursting.  Our finding provides powerful evidence that onset of seizures can be associated with a reduction in synaptic transmission.  These results open a new avenue to explore network synchrony and may ultimately lead to a rational approach to treatment of network pathology in epilepsy.
W. van Drongelen, H. C. Lee, H. Koch, F. Elsen, M. S. Carroll, M. Hereld, R. L. Stevens, "Interaction between Cellular Voltage-Sensitive Conductance and Network Parameters in a Model of the Neocortex Can Generate Epileptiform Bursting," Preprint ANL/MCS-P1172-0604, June 2004.

We examined the effects of both intrinsic neuronal membrane properties and network parameters on oscillatory activity in a model of the neocortex.  A scalable network model with six different cell types was built with the pGENESIS neural simulator.  The neocortical network consisted of two types of pyramidal cells and four types of inhibitory interneurons.  All cell types contained both fast sodium and delayed rectifier potassium channels for generation of action potentials.  A subset of the pyramidal neurons contained an additional slow-inactivating (persistent) sodium current (NaP).  The neurons with the NaP current showed spontaneous bursting activity in the absence of external stimulation.  The model also included a routine to calculate a simulated electroencephalogram trace from the population activity.  This revealed emergent network behavior, which ranged from desynchronized activity to different types of seisurelike bursting patterns.  At settings with weaker excitatory network effects, the propensity to generate seizurelike behavior increased.  Strong excitatory network connectivity destroyed oscillatory behavior, whereas weak connectivity enhanced the relative importance of the spontaneously bursting cells.  Our findings contradict the general opinion that strong excitatory synaptic or insufficient inhibition effects are associated with seizure initiation, but our results agree with previously reported behavior in the neocortex.

C. E. Frouzakis, A. G. Tomboulides, P. Papas, P. F. Fischer, R. M. Rais, P. A. Monkewitz, and K. Boulouchos, "Three-Dimensional Numerical Simulations of Cellular Jet Diffusion Flames," presented at Combustion Institute, Chicago, IL, June 25-30, 2004.  Also Preprint ANL/MCS-P1173-0604, June 2004. Recent experimental investigations have demonstrated that the appearance of particular cellular states in circular non-premixed jet flames depends significantly on a number of parameters, including the initial mixture strength, reactant Lewis numbers, and proximity to the extinction limit (Damköhler number).  For CO2-diluted H2/O2 jet diffusion flames, these studies have shown that a variety of different cellular patterns or states can form.  For given fuel and oxidizer compositions, several preferred states were found to coexist, and the particular state realized was determined by the initial conditions.  In order to elucidate the dynamics of cellular instabilities, circular non-premixed jet flames are modeled with a combination of three-dimensional numerical simulation and linear stability analysis (LSA).  In both formulations, chemistry is described by a single-step, finite-rate reaction, and different reactant Lewis numbers and molecular weights are specified.  The three-dimensional numerical simulations show that different cellular flames can be obtained close to extinction and that different states coexist for the same parameter values.  Similar to the experiments, the behavior of the cell structures is sensitive to (numerical) noise.  During the transient blow-off process, the flame undergoes transitions to structures with different numbers of cells, while the flame edge close to the nozzle oscillates in the streamwise direction.  For conditions similar to the experiments discussed, the LSA results reveal various cellular instabilities, typically with azimuthal wavenumber m = 1-6.  Consistent with previous theoretical work, the propensity for the cellular instabilities is shown to increase with decreasing reactant Lewis number and Damköhler number.
W. Ligon and R. Ross, "Parallel I/O and the Parallel Virtual File System," in Beowulf Cluster Computing with Linux, second edition, edited by W. Gropp, E. Lusk, and T. Sterling, Cambridge, MA (2003), pp. 593-534.  Also Preprint ANL/MCS-P1174-0504, May 2004. Ever more frequently users of clusters find themselves in an interesting situation: it isn't the processors, communication network, or memory that is limiting their application; it is the storage system.  This might force the users to checkpoint less frequently than they would like, might limit the resolution of output visualization data, or might prevent the use of out-of-core solutions needed for the largest of problems.  What's worse, the I/O hardware in the system may indeed be adequate for the user's needs but may be begin used inefficiently by one of the many software layers involved.
S. J. Benson, L. McInnes, J. J. Moré, and J. Sarich, "Scalable Algorithms in Optimization: Computational Experiments," Preprint ANL/.MCS-P1175-0604, June 2004. We survey techniques in the Toolkit for Advanced Optimization (TAO) for developing scalable algorithms for mesh-based optimization problems on distributed architectures.  We discuss the distribution of the mesh, the computation of the gradient and the Hessian Matrix, and the use of preconditioners.  We show that these techniques, together with mesh sequencing, can produce results that scale with mesh size.
J. Lee, X. Ma, R. Ross, R. Thakur, M. Winslett, "RFS: Efficient and Flexible Remote File Access for MPI-IO," Preprint ANl/MCS-P1176-0604, June 2004. Scientific applications often need to access remote file systems.  Because of slow networks and large data size, however, remote I/O can become an even more serious performance bottleneck than local I/O performance. In
this work, we present RFS, a high-performance remote I/O facility for ROMIO, which is a well-known MPI-IO implementation. Our simple, portable, and flexible design eliminates the shortcomings of previous remote I/O efforts. In particular, RFS improves the remote I/O performance by adopting active buffering with threads (ABT), which hides I/O cost by aggressively buffering the output data using available memory and performing background I/O using threads while computation is taking place. Our experimental results show that RFS with ABT can significantly reduce the remote I/O visible cost, achieving up to 92% of the theoretical peak throughput.  The computation slowdown caused by concurrent I/O activities was 0.2-6.2%, which is dwarfed by the overall performance improvement in application turnaround time.
W. Yu, D. Buntinas, and D. K. Panda, "Scalable and High Performance NIC-Based All-to-All Broadcast Over Myrinet/GM," Preprint ANL/MCS-P1177-0604, June 2004. All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program.  Previously, programmable Network Interface Cards (NICs) have been leveraged to provide efficiently support for collective operations, including barrier, broadcast, and reduce.  This paper explores the characteristics of all-to-all broadcast and proposes different algorithms to exploit the potential advantages from NIC programmability.  Along with these algorithms, salient strategies have been utilized to provide scalable topology management, global buffer management, efficient communication processing, as well as reliability.  The resulting algorithms have been incorporated into a NIC-based collective protocol over Myrinet/GM.  Over 16 nodes the NIC-based all-to-all broadcast operations improves all-to-all broadcast bandwidth about three times, compared to the host-based all-to-all broadcast operation.  Furthermore, the NIC-based operations have better scalability to large systems and very low host CPU utilization.  To the best of the authors' knowledge, this paper is the first in the literature to report efficient NIC-based all-to-all broadcast algorithms over Myrinet/GM.
W. Gropp, R. Ross, and N. Miller, "providing Efficient I/O Redundancy in MPI Environments," Preprint ANL/MCS-P1178-0604, June 2004. Highly parallel applications often make use of either highly parallel file systems or large numbers of independent disks.  Either of these approaches can provide high data rates necessary for parallel applications.  However, the failure of a single disk or server can render the data useless.  Conventional techniques, such as those based on applying erasure correcting codes to each file write, are prohibitively expensive for massively parallel scientific applications because of the granularity of access at which codes are applied.  In this paper we demonstrate a scalable method for recovering from single disk failures that is optimized for typical scientific data sets.  This approach exploits coarser-grain (but precise) semantics to reduce the overhead of constructing recovery data and makes use of parallel computation (proportional to the data size and independent of number of processors) to construct data.  Experiments showing the efficiency of this approach on a cluster with independent disks are present, and a technique for hiding the creation of redundant data within the MPI-IO implementation is described.
L. C. McInnes, B. A. Allan, R. Armstrong, S. J. Benson, D. E. Bernholdt, T. L. Dahlgren, L. F. Diachin, M. Krishnan, J. A. Kohl, J. W. Larson, S. Lefantzi, J. Nieplocha, B. Norris, S. G. parker, J. Ray, and S. Zhou, "Parallel PDE-Based Simulations Using the Common Component Architecture," Preprint ANL/MCS-P1179-0704, July 2004. The complexity of parallel PDE-based simulations continues to increase as multimodel, multiphysics, and multi-institutional projects become widespread.  A goal of component-based software engineering in such large-scale simulations is to help manage this complexity by enabling better interoperability among various codes that have been independently developed by different groups.  The Common Component Architecture (CCA) Forum is defining a component architecture specification to address the challenges of high-performance scientific computing.  In addition, several execution frameworks, supporting infrastructure, and general-purpose components are being developed.  Furthermore, this group is collaborating with others in the high-performance computing community to design suites of domain-specific component interface specifications and underlying implementations.  This chapter discusses recent work on leveraging these CCA efforts in parallel PDE-based simulations involving accelerator design, climate modeling, combustion, and accidental figures and explosions.  We explain how component technology helps to address the different challenges posed by each of these applications, and we highlight how component interfaces built on existing parallel toolkits facilitate the reuse of software for parallel mesh manipulation, discretization, linear algebra, integration, optimization, and parallel data redistribution.  We also present performance data to demonstrate the suitability of this approach, we we discuss strategies for applying component technologies to both new and existing applications.
H. Zhang, K. Keahey, and W. Allcock, "Providing Data Transfer with QoS as Agreement-Based Service," Preprint ANL/MCS-P1180-0704, July 2004. Over the last decade, Grids have become a successful tool for providing distributed environments for secure and coordinated execution of applications.  The successful deployment of many realistic applications in such environments on a large scale has motivated their use in experimental science, where Grid-based computations are used to assist in ongoing experiments.  In such scenarios, quality of service (QoS) guarantees on execution as well as data transfer are desirable.  The recently proposed WS-Agreement model provides an infrastructure within which such quality of service can be negotiated and obtained.  We have designed and implemented a data transfer service that exposes an interface based on this model and defines agreements which guarantee that, within a certain confidence level, file transfer can be completed under a specified time.  The data transfer service accepts a client's request for data transfer and makes an agreement with the client based on QoS metrics (such as the transfer time and confidence level with which the service can be provided).  In our approach we use prediction as a base for formulating an agreement with the client, and we combine prediction and rate limiting to adaptively ensure that the agreement is met.
R. Latham, R. Ross, and R. Thakur, "The Impact of File Systems on MPI-IO Scalability," Preprint ANL/MCS-P1182-0604, June 2004. As the number of nodes in cluster systems continues to grow, leveraging scalable algorithms in all aspects of such systems becomes key to maintaining performance.  While scalable algorithms have been applied successfully in some areas of parallel I/O, many operations are still performed in an uncoordinated manner.  In this work we consider, in three file system scenarios, the possibilities for applying scalable algorithms to the many operations that make up the MPI-IO interface.  From this evaluation we extract a set of file system characteristics that aid in the implementation of scalable MPI-IO implementations.
G. Almási, C. Archer, J. G. Castaños, J. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. Steinmacher-burow, W. Gropp, B. Toonen, "The Design and Implementation of Message Passing Services for the BlueGene/L Supercomputer," Preprint ANL/MCS-P1183-0604, June 2004. The BlueGene/L supercomputer, with 65,536 dual-processor compute nodes, was designed from the group up to support efficient execution of massively parallel message passing programs.  Part of this support is an optimized implementation of MPI that leverages the hardware features of BlueGene/L.  MPI for BlueGene/L is implemented on top of a more basic message-passing infrastructure called the message layer.  This message layer can be used both to implement other higher-level libraries and directly by applications.  MPI and the message layer are used in the two modes of operation of BlueGene/L: coprocessor mode and virtual node mode.  Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine.  They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.
T. Araki, "Autonomic WWW Server Management with Distributed Resources," Preprint ANL/MCS-P1185-0704, July 2004. If many people access a Web server at one time, the server might not be able to respond within an acceptable time or even provide the service.  Therefore, enough servers should be assigned to a service to guarantee quality of service.  But reserving a lot of resources for peak access is not cost effective, because these resources are idle most of the time.  In order to solve this problem, technologies called utility computing or autonomic computing have been proposed and are under development.  However, these technologies utilize resources only within one organization.  In this paper, we present an autonomic system architecture that use distributed resources leveraged by Grid technology.  In our architecture, computing resources are rented from different organizations.  Our architecture supports the J2EE system; hence, existing Web applications can be used without any modification.  In addition, our architecture considers the location of the resources when redirecting a request to a server and allocating a new server, thereby leading to better performance.  We adopted WS-Agreement as an interface for negotiating service-level agreements.  We have implemented and evaluated this system and confirmed the effectiveness of this architecture.
J. N. Lyness, "A Prototype Four-Dimensional Galerkin-Type Integral," RIMS (Kyoto Univ.) 41 (2005), pp. 843-851.  Also Preprint ANL/MCS-P1186-0604, June 2004, An error functional expansion is constructed for a four-dimensional integral over [0,1]4 whose integrand function has a singularity structure of a type that occurs in integrals in the Galerkin boundary element method.  We show with an example how extrapolation may be used to evaluate this integral.
G. G. Maisuradze, A. Kawano, D. L. Thompson, A. F. Wagner, and M. Minkoff, "Interpolating Moving Least-Squares Methods for Fitting Potential Energy Surfaces: Analysis of an Application to a Six-Dimensional System," Preprint ANL/MCS-P1187-0704, July 2004. The basic formal and numerical aspects of different degree interpolated moving least-squares (IMLS) methods are applied to a six-dimensional potential energy surface (PES) of the HOOH molecule, for which an analytic (“exact”) potential is available in the literature. The results of systematic investigations of the effects of weight function parameters, the degree and partial degree of IMLS, the number of data points allowed, and the optimal automatic point selection of data points up to full third-degree IMLS (TD-IMLS) fits are reported. With partial reduction of cross terms and automatic point selection the full 6D HOOH PES can be fit over a range of 100 kcal/mol to an accuracy of less than 1 kcal/mol with ~1350 ab initio points.
Y. Guo, A. Kawano, D. L. Thompson, A. F. Wagner, and M. Minkoff, "Interpolating Moving Least-Squares Methods for Fitting Potential Energy Surfaces: Applications to Classical Dynamics Calculations," Preprint ANL/MCS-P1188-0704, July 2004. As a continuation of our efforts to develop efficient and accurate interpolating moving least-squares (IMLS) methods for generating potential energy surfaces, for the first time we carry out classical trajectories and compute kinetics properties on higher degree IMLS surfaces.  In this study, we have investigated the choice of coordinate system, the range of points (i.e., the cut-off radius) used in fitting, strategies for selections of data points and basic elements.  We illustrate and test the method by applying it to hydrogen peroxide (HOOH).  In particular, reaction rates for the O-O bond breaking in HOOH are calculated on fitted surfaces using the classical trajectory approach to test the accuracy of the IMLS method for providing potentials for dynamics calculations.
B. Addis and S. Leyffer, "A Trust-Region Algorithm for Global Optimization," Preprint ANL/MCS-P1190-0804, August 2004. We consider the global minimization of a bound-constrained function with a so-called funnel structure.  We develop a two-phase procedure that uses sampling, local optimization, and Gaussian smoothing to construct a smooth model of the underlying funnel.  The procedure is embedded in a trust-region framework that avoids the pitfalls of a fixed sampling radius.  We present a numerical comparison to three popular methods and show that the new algorithm is robust and uses up to 20 times fewer local minimizations steps.
Y. Chen, B. F. Hobbs, S. Leyffer, and T. S. Munson, "Leader-Follower Equilibria for Electric Power and NOx Allowances Markets," Preprint ANL/MCS-P1191-0804, August 2004. This paper investigates the ability of the largest producer in an electricity market to manipulate both the electricity and emission allowances markets to its advantage. A Stackelberg game to analyze this situation is constructed in which the largest firm plays the role  of the leader, while the medium-sized firms are treated as Cournot followers with price-taking fringes that behave competitively in both markets. Since there is no explicit representation of the best-reply function for each follower, this Stackelberg game is formulated as a large-scale mathematical program with equilibrium constraints. The best-reply functions are implicitly represented by a set of nonlinear complementarity conditions. Analysis of the computed solution for the Pennsylvania - New Jersey - Maryland electricity market shows that the leader can gain substantial profits by withholding allowances and driving up NOx allowance costs for rival producers. The allowances price is higher than the corresponding price in the Nash-Cournot case, although the electricity prices are essentially the same.
R. Padmanabhan, W. McCune, R. Veroff, "Levi's Commutator Theorems for Cancellative Semigroups," Semigroup Forum 71 (2005), pp. 152-157.  Also Preprint ANL/MCS-P1192-0804, August 2004.
Levi's commutator theorems in group theory are generalized to cancellative semigroups.  The theorems involve associativity, distributivity, and class two nilpotence of commutator expressions. The cancellative semigroup theorems are proved by automated deduction. The work is related to conjecture of Padmanabhan, that is, if a certain class of statement is provable in group theory, then it is also provable in cancellative semigroups.
W. D. Gropp, "Issues in Accurate and Reliable Use of Parallel Computing in Numerical Programs," Preprint ANL/MCS-P1193-0804, August 2004. Parallel computing, as used in technical computing, is fundamentally about achieving higher performance than is possible with a uniprocessor.  The unique issues for accurate and reliable computing are thus directly related to this desire for performance.  This paper briefly covers the major types of parallel computers, the programming models used with them, and issues in the choice and implementation of algorithms.
U. Naumann and J. Utke, "Optimality-Preserving Elimination of Linearities in Jacobian Accumulation," Preprint ANL/MCS-P1194-0904, September 2004. We consider a mathematical function that is implemented in high-level programming languages such as C or Fortran.  This function is assumed to be differentiable in some neighborhood of a set of input arguments.  For available local partial derivatives of the arithmetic