Next: Upper Triangular Matrix Multiplication Up: Compile-Time Minimisation of Load Previous: Generalized Loop Nests

Evaluation and Experimental Results

In order to evaluate the performance gains of the partitioning strategy described in the previous section, as compared to other compile-time strategies (as well as to those applied by parallelising compilers), a series of experiments has been conducted. Two routes have been followed for analysing the results: the first compares the values of load imbalance, computed as suggested in Section 2.1, when applying different mapping schemes; the second compares the resulting performance on a virtual shared memory computer, the KSR1. Our objective has been not only to evaluate the effectiveness of the schemes introduced in the previous section, but also to establish the appropriateness of the theoretical values for load imbalance as a means of justifying the selection of a particular mapping scheme.

Two benchmark programs are used. The schemes compared are denoted by the shorthands KAP, MARS, CYC, BCS, and CAN; KAP corresponds to the mapping strategy of the KAP auto-parallelising compiler, MARS corresponds to the mapping strategy of the MARS experimental parallelising compiler [3], CYC corresponds to a cyclic way of mapping the iterations onto processors (i.e., processor 0 executes iterations $1, p+1, 2p+1, \dots$ , processor 1 executes iterations $2, p+2, 2p+2, \dots$ , in general, processor , $0 \leq i \leq p-1$ , executes iterations $i+1+kp, k=0,1,2,\dots,n/p-1$ [11]), BCS corresponds to balanced chunk scheduling [10] (extended to support loop nests of depth ), and the general term CAN corresponds to the partitioning scheme described by (7). A suffix is added to CAN to distinguish between different values of and/or transformations applied; these are described below as appropriate.

Subsections

Next: Upper Triangular Matrix Multiplication Up: Compile-Time Minimisation of Load Previous: Generalized Loop Nests

Rizos Sakellariou 2000-07-31