The code for the first benchmark, shown below,
performs the multiplication of
two upper triangular
matrices.
DOALL J=1,N
DO I=1,J
DO K=I,J
A(I,J)=A(I,J)+B(I,K)*C(K,J)
ENDDO
ENDDO
ENDDO
Clearly, the loop nest
is canonical of depth 3 (see Definition 1), and a partitioning scheme
based on (7), for
, may lead
to perfect load balance; for comparison, the partitioning scheme
corresponding to
is also implemented. The two schemes
are referred to as CAN-3 and CAN-2, respectively.
The load imbalance,
, computed in terms of the
work associated with the assignment statement of the loop body,
and the corresponding relative load imbalance,
, for two different
values of N,
and
, are shown in Table 1.
It can be seen that MARS and KAP exhibit high
load imbalance, CAN-2 exhibits relatively smaller
load imbalance, while the remaining three mapping schemes exhibit
significantly smaller load imbalance; in all cases, CAN-3
exhibits the smallest value.
The partitioned programs were run on the KSR1, using the same two values of N; their performance is depicted in Figure 5 where the ideal line assumes linear speed-up. In both graphs, KAP and MARS perform worst of all while CAN-3 performs best; the performance of CAN-3 is comparable with that of CYC and BCS. These results appear consistent with the performance anticipated from the expected values of load imbalance shown in Table 1.