Banded SYR2K

**Figure 5:** Performance of mapping schemes on the KSR1 for upper triangular matrix multiplication.
$\begin{figure}\begin{center} \noindent\epsfig{figure=ex3/ex3a.ps, width=88truemm... ...ps, width=88truemm}\end{center}\par {\em ~~~b) {\tt N}~$=1024$} \par\end{figure}$

The second benchmark, banded symmetric rank-

update (SYR2K), contains non-affine bounds; the version we use is shown below.

Clearly, the loop nest is not canonical (see Definition 1). However, converting the MIN and MAX functions to IF statements, and removing the latter by index set splitting (see Section 3.2), the code can be transformed into four consecutive canonical loop nests assuming that N

2*BB-1 [15]; this version is denoted CAN-3t. For comparison, two additional mapping schemes are also implemented; they are based on direct application of the partitioning schemes described by (7) for

and

to the original loop nest, regardless of the fact that the latter is not canonical. These two versions are henceforth referred to as CAN-2 and CAN-3, respectively. No version based on balanced chunk scheduling is implemented since loop nests having bounds containing MIN and MAX functions do not conform to its requirements.

5.24pt

`N`,	Mapping	Number of processors
`BB`	scheme							12		16

512,	KAP/MARS	1004896	.350	764592	.450	447832	.490	331685	.516	240300	.507
64	CYC	15360	.008	23056	.024	26936	.055	28645	.084	28940	.110
	CAN-2	992	.001	19216	.020	17920	.037	12597	.039	9920	.041
	CAN-3	8192	.004	1024	.001	128	.000	1633	.005	560	.002
	CAN-3t	2080	.001	31248	.032	60360	.114	73227	.191	31668	.120
1024,	KAP/MARS	30758272	.367	23767744	.473	13981024	.513	9924928	.529	7514800	.531
256	CYC	114688	.002	172096	.006	200928	.015	211168	.023	215600	.031
	CAN-2	1851264	.034	1478464	.053	1146880	.079	692496	.073	537360	.075
	CAN-3	524288	.010	65536	.002	8192	.001	22392	.003	1024	.000
	CAN-3t	128	.000	244800	.009	252960	.019	510110	.054	452880	.064

**Figure 6:** Performance of mapping schemes on the KSR1 for banded SYR2K.
$\begin{figure}\begin{center} \noindent\epsfig{figure=ex4/ex4a.ps, width=88truemm... ...m}\end{center}\par {\em ~~~b) {\tt N}~$=1024$, {\tt BB}~$=256$} \par\end{figure}$

The load imbalance,

, computed in terms of the work associated with the assignment statement of the loop body, and the corresponding relative load imbalance,

, for two different pairs of values for N and BB, $\{512, 64\}$ , and $\{1024, 256\}$ , are shown in Table 2. It can be seen that MARS and KAP exhibit high load imbalance, while the remaining four schemes exhibit significantly smaller load imbalance; CAN-3 exhibits, on average, the smallest value of load imbalance.

The partitioned programs were run on the KSR1, using the same two pairs of values for N and BB; their performance is depicted in Figure 6. In the first case, in Figure 6.a, KAP and MARS perform worst of all, except when running on 16 processors where CYC performs worst of all. CAN-3t performs best of all when using less than 12 processors; equally good results are achieved by CAN-3 and, to some extent, CAN-2. CYC exhibits a somewhat odd behaviour; it performs nearly best of all when running on 12 processors, but performs worst of all when running on 16 processors, and nearly worst when running on 8 processors. This is due to a significant number of cache misses when the number of processors is a power of 2.

Similar remarks can be made about the results depicted in Figure 6.b. CAN-3t performs best of all; CAN-3 exhibits a comparable performance, but CAN-2 performs significantly worse. KAP and MARS perform worst of all except when using 8 or 16 processors; in these cases, CYC, whose performance also suffers from a high number of cache misses as in Figure 6.a, performs worst of all.

Another interesting observation, in connection with the load imbalance computed in Table 2 and the actual performance depicted in Figure 6, is that, although CAN-3t nearly always exhibits a higher load imbalance than CAN-3 (except when the number of processors is 2), its actual performance is generally better than that of CAN-3 (except when running on more than 12 processors, where the difference in the anticipated load imbalance between the two partitioning schemes becomes comparatively higher); the performance gains of CAN-3t are due to the elimination of MIN and MAX functions from the loop bounds (apart from those necessary for partitioning the outermost, parallel loop).