next up previous
Next: Banded SYR2K Up: Evaluation and Experimental Results Previous: Evaluation and Experimental Results

Upper Triangular Matrix Multiplication

The code for the first benchmark, shown below, performs the multiplication of two upper triangular $n \times n$ matrices.

       DOALL J=1,N
          DO I=1,J
             DO K=I,J
                A(I,J)=A(I,J)+B(I,K)*C(K,J)
             ENDDO
          ENDDO
       ENDDO

Clearly, the loop nest is canonical of depth 3 (see Definition 1), and a partitioning scheme based on (7), for $m=3$, may lead to perfect load balance; for comparison, the partitioning scheme corresponding to $m=2$ is also implemented. The two schemes are referred to as CAN-3 and CAN-2, respectively.

The load imbalance, $L$, computed in terms of the work associated with the assignment statement of the loop body, and the corresponding relative load imbalance, $L_R$, for two different values of N, $256$ and $1024$, are shown in Table 1. It can be seen that MARS and KAP exhibit high load imbalance, CAN-2 exhibits relatively smaller load imbalance, while the remaining three mapping schemes exhibit significantly smaller load imbalance; in all cases, CAN-3 exhibits the smallest value.

The partitioned programs were run on the KSR1, using the same two values of N; their performance is depicted in Figure 5 where the ideal line assumes linear speed-up. In both graphs, KAP and MARS perform worst of all while CAN-3 performs best; the performance of CAN-3 is comparable with that of CYC and BCS. These results appear consistent with the performance anticipated from the expected values of load imbalance shown in Table 1.


next up previous
Next: Banded SYR2K Up: Evaluation and Experimental Results Previous: Evaluation and Experimental Results
Rizos Sakellariou 2000-07-31