In order to evaluate the performance gains of the partitioning strategy described in the previous section, as compared to other compile-time strategies (as well as to those applied by parallelising compilers), a series of experiments has been conducted. Two routes have been followed for analysing the results: the first compares the values of load imbalance, computed as suggested in Section 2.1, when applying different mapping schemes; the second compares the resulting performance on a virtual shared memory computer, the KSR1. Our objective has been not only to evaluate the effectiveness of the schemes introduced in the previous section, but also to establish the appropriateness of the theoretical values for load imbalance as a means of justifying the selection of a particular mapping scheme.
Two benchmark programs are used. The schemes
compared are denoted by the shorthands KAP, MARS, CYC,
BCS, and CAN; KAP corresponds to the mapping
strategy of the KAP auto-parallelising compiler,
MARS corresponds to
the mapping strategy of the MARS experimental parallelising
compiler [3],
CYC corresponds to a cyclic way of mapping the iterations onto
processors (i.e., processor 0 executes
iterations
, processor 1 executes iterations
, in general, processor
,
,
executes iterations
[11]),
BCS corresponds to balanced chunk scheduling [10] (extended
to support loop nests of depth
),
and the general term CAN corresponds to the
partitioning scheme described by (7). A suffix is added
to CAN to distinguish between different values of
and/or transformations applied; these are described below as appropriate.