The second benchmark, banded symmetric rank-
update
(SYR2K), contains non-affine
bounds; the version we use is shown below.
DOALL I=1,MIN(N,2*BB-1)
DO J=MAX(1-BB,1-N),MIN(BB-I,N-I)
DO K=MAX(1,I+J),MIN(N+J,N)
C(-I-J+K+1,I)=C(-I-J+K+1,I)+A(K,-I-J+BB+1)*B(K,-J+BB)
& +A(K,-J+BB)*B(K,-I-J+BB+1)
ENDDO
ENDDO
ENDDO
Clearly, the loop nest is not canonical (see
Definition 1). However, converting the MIN and MAX
functions to IF statements, and removing the
latter by index set splitting (see Section
3.2), the code can be
transformed into four consecutive canonical loop nests assuming that
N
2*BB-1 [15]; this version
is denoted CAN-3t. For comparison, two additional mapping
schemes are also implemented; they are based on direct application
of the partitioning schemes described by (7) for
and
to the original loop nest, regardless of the fact that
the latter is not canonical. These two versions are henceforth
referred to as CAN-2 and CAN-3, respectively. No version
based on balanced chunk scheduling is implemented since loop nests
having bounds containing MIN and MAX functions do not
conform to its requirements.
The load imbalance,
, computed in terms of the work associated
with the assignment statement of the loop body, and the corresponding
relative load imbalance,
, for two different pairs of values
for N and BB,
, and
, are shown in
Table 2. It can be seen that MARS and KAP
exhibit high load imbalance, while the remaining
four schemes exhibit significantly smaller load imbalance;
CAN-3 exhibits, on average, the smallest value of load
imbalance.
The partitioned programs were run on the KSR1, using the same two pairs of values for N and BB; their performance is depicted in Figure 6. In the first case, in Figure 6.a, KAP and MARS perform worst of all, except when running on 16 processors where CYC performs worst of all. CAN-3t performs best of all when using less than 12 processors; equally good results are achieved by CAN-3 and, to some extent, CAN-2. CYC exhibits a somewhat odd behaviour; it performs nearly best of all when running on 12 processors, but performs worst of all when running on 16 processors, and nearly worst when running on 8 processors. This is due to a significant number of cache misses when the number of processors is a power of 2.
Similar remarks can be made about the results depicted in Figure 6.b. CAN-3t performs best of all; CAN-3 exhibits a comparable performance, but CAN-2 performs significantly worse. KAP and MARS perform worst of all except when using 8 or 16 processors; in these cases, CYC, whose performance also suffers from a high number of cache misses as in Figure 6.a, performs worst of all.
Another interesting observation, in connection with the load imbalance computed in Table 2 and the actual performance depicted in Figure 6, is that, although CAN-3t nearly always exhibits a higher load imbalance than CAN-3 (except when the number of processors is 2), its actual performance is generally better than that of CAN-3 (except when running on more than 12 processors, where the difference in the anticipated load imbalance between the two partitioning schemes becomes comparatively higher); the performance gains of CAN-3t are due to the elimination of MIN and MAX functions from the loop bounds (apart from those necessary for partitioning the outermost, parallel loop).