TAO Benchmarks

 

In our work we have emphasized the benchmarking of the codes to make sure that performance is adequate for applications. In the results mentioned below we discuss results for two algorithms, the limited-memory variable metric algorithm (LMVM) and the gradient projection/conjugate gradient (GPCG) algorithm.

The limited-memory variable-metric algorithm (LMVM) requires only function and gradient information. This algorithm is especially useful for large scale problems when the Hessian matrix is either unavailable or too expensive to compute. In this algorithm the approximation to the Hessian matrix is not stored explicitly; instead, it is defined implicitly in terms of a fixed number of vectors.

We have benchmarked the LMVM algorithm on a minimal surface problem, which is a variational problem defined over a two-dimensional rectangular grid. In order to test the effectiveness of LMVM to solve large problems, we present sample results for a fixed-size problem with 2,560,000 variables that is run on varying numbers of processors of an IBM SP with 120 MHz P2SC nodes with two 128 MB memory cards each and a TB3 switch.

Processors Iterations Time (sec) Efficiency
4 1896 6360 100
8 1887 3160 100
16 1934 1602 99
32 1958 865 92
64 1853 423 94

These results show that LMVM has excellent parallel efficiency for large problems, where we achieve efficiency for this fixed-size problem in excess of 92% as processors range from four through sixty-four. At first sight it is surprising that the number of iterations varies with the number of processors. This behavior is to be expected from optimization algorithms that require a large number of iterations since optimization algorithms tend to be discontinuous with respect to input data. We note that the dominant operations in the LMVM algorithm involve vectors (e.g., AXPYs, dot products, norms); in particular, no matrix manipulations are needed since the Hessian matrix is not defined explicitly.

We have also benchmarked the GPCG algorithm on the journal bearing problem. We present sample results for a fixed-size problem with 2,560,000 variables on an IBM SP. The GPCG uses a gradient projection algorithm to predict the face where the solution lies, and a conjugate gradient algorithm to explore this face. Thus, at each iteration the GPCG algorithm must use a conjugate gradient algorithm to solve (approximately) a system of linear equations whose coefficient matrix is the submatrix of the Hessian of the quadratic with respect to the free variables for the current iterate. Since the free variables can change arbitrarily from iterate to iterate, solving these systems of linear equations may create a potential load-balancing bottleneck that must be addressed in an efficient implementation. In view of this description, our results below for GPCG are noteworthy in several ways.

Processors Faces Time (sec) Efficiency
8 46 3341 100
16 45 1781 94
32 46 928 90
64 46 522 80

First of all, the number of faces visited by GPCG is remarkably small. Other strategies can lead to a large number of gradient projection iterates, but the GPCG algorithm is remarkably efficient. Another interesting aspect is that because of the low memory requirements of iterative solvers, we are able to solve problems with over 2.5 million variables with only 8 processors. Strategies that rely on direct solvers are likely to need significantly more storage, and thus more processors. Finally, these results show that the GPCG implementation has excellent efficiency. For example, the efficiency of GPCG with respect to 8 processors ranges between 80% and 90%. This sustained efficiency is remarkable because the GPCG algorithm is solving a sequence of linear problems with a coefficient matrix set to the submatrix of the Hessian of the quadratic with respect to the free variables for the current iterate. Thus, our implementation's repartitioning of submatrices effectively deals with the load-balancing problem that is inherent in the GPCG algorithm or, more generally, active set methods. Additional details on these results can be found in this extended abstract.