In order to evaluate the performance gains of the partitioning strategy
described in the previous section, we consider the parallelisation of the
code shown in figure 4, which corresponds to the multiplication
of two upper triangular matrices [6] (the two outer
loops have been interchanged in order to increase the number of
unit stride array references).
Clearly, the loop nest
is canonical (see Definition 3.1), and, given that it has a depth
of 3, a partitioning scheme based on the proof described in
Theorem 3.1 may lead to perfect load balance.
We compared this scheme (henceforth denoted by CAN)
with three other compile-time mapping
schemes, subsequently denoted by the shorthands KAP, MARS,
and CYC; KAP corresponds to the mapping
strategy of the KAP auto-parallelising compiler (which is based
on scheduling chunks of consecutive iterations having a fixed size),
MARS corresponds to the mapping
strategy of the MARS experimental parallelising compiler [2]
(which is equivalent to compile-time partitioning into a number of
chunks of consecutive iterations equal to the number of processors),
and CYC corresponds to a cyclic (or wrap [5])
way of mapping the iterations onto processors (i.e., processor 0 executes
iterations
, processor 1 executes iterations
, in general, processor
,
,
executes iterations
).
The parallelised programs were run on a KSR1, using two different
values for N, and
; their execution time is shown
in Table 1. In both cases, KAP and MARS perform
worst of all while CAN performs best. The performance of CYC
is comparable with that of CAN when using a relatively small number
of processors; for more than 12 processors, CYC causes a large
number of cache misses.
Mapping | Number of processors | ||||||
N | scheme | 1 | 2 | 4 | 8 | 12 | 16 |
256 | KAP | 4.03 | 3.64 | 2.78 | 1.71 | 1.12 | 0.97 |
MARS | 3.99 | 3.63 | 2.76 | 1.68 | 1.06 | 0.95 | |
CYC | 4.02 | 2.05 | 1.07 | 0.55 | 0.40 | 0.35 | |
CAN | 3.99 | 2.03 | 1.04 | 0.51 | 0.38 | 0.29 | |
1024 | KAP | 527 | 458 | 308 | 174 | 128 | 110 |
MARS | 524 | 457 | 310 | 179 | 130 | 113 | |
CYC | 525 | 267 | 154 | 69 | 55 | 49 | |
CAN | 524 | 265 | 152 | 66 | 46 | 36 |