

## Chapter 8 Intra-Node Tuning

FUJITSU LIMITED April 2016

Copyright 2016 FUJITSU LIMITED

### Contents (1/2)



#### CPU Tuning

- What Is CPU Tuning (Intra-Node Tuning)?
- Positioning of CPU Tuning

### How to Effectively Use PA Information and Tuning Flows

### Contents (2/2)



- Navigation from PA Information to Tuning Techniques
   Tuning Map
  - Tuning Technique List
- Scalar Tuning
  - Improvement in Data Access Wait (Improvement in Thrashing)
  - Improvement in Data Access Wait (Increase in Data Locality)
  - Improvement in Data Access Wait (Latency Concealment)
  - Improvement in Data Access Wait (Reduced Amount of Access)
  - Improvement in Operation Wait (Instruction Scheduling Improvement)
- Thread Parallelization Processing Tuning
  - Thread Parallelization ratio Improvement
  - Execution Efficiency Improvement of Thread Parallelization Processing



### **CPU** Tuning

- What Is CPU Tuning (Intra-Node Tuning)?
- Positioning of CPU Tuning

What Is CPU Tuning (Intra-Node Tuning)?



CPU tuning (Intra-Node tuning) improves execution efficiency on a multi-core CPU.

The types of CPU tuning are scalar tuning and thread parallelization processing tuning.

Their various approaches to improvement include source tuning, optimization control line tuning, and compiler options tuning.

#### Scalar tuning

This tuning improves execution efficiency on a multi-core CPU by focusing attention on the cores.

#### Thread parallelization processing tuning

This tuning improves the thread parallelization ratio and execution efficiency of thread parallelization processing on a multi-core CPU.

### Positioning of CPU Tuning







# How to Effectively Use PA Information and Tuning Flows

- How to Effectively Use PA Information
- Tuning Flow
  - 1. Hot Spot Detection
  - 2. PA Information Collection
  - 3. Breakdown to the Level of Hot Spots
  - 4. Analysis and Diagnosis: Hot Spot (1)
  - 5. Measures and Effects: Hot Spot (1)

### How to Effectively Use PA Information

#### Understanding bottlenecks

You can determine bottlenecks in the entire evaluation region under focus (except input/output and communication), from PA information for the entire evaluation region.



PA information must be broken down to the level of loops and analyzed.

### 1. Hot Spot Detection



First, detect hot spots in the evaluation region under focus. To detect hot spots, use the sampling region specification function of fipp.

#### What is the sampling region specification function?

You can collect cost information for the specified region by using the sampling region specification function. To specify a measurement section in the source code, insert C or C++ functions or Fortran subroutines at the start and end points of cost information measurement.

|               |                   | Insertion diagram "Entire evaluation                          |
|---------------|-------------------|---------------------------------------------------------------|
| Function name | Function          | region" enclosed by sampling region                           |
| fipp_start    | Measurement start | call fipp_start() specification function<br>Evaluation region |
| fipp_stop     | Measurement end   | call fipp_stop()                                              |
|               |                   |                                                               |

- \* If the evaluation region under focus is the entire program, the sampling region specification function is not needed.
- \* For details on the sampling region specification function, see the tutorial in "Chapter 7 Tuning Tool."

### 2. PA Information Collection



Here, collect PA information for detected hot spots. Use the advanced profiler routines of fapp (precision PA) because analysis requires highly precise PA information.

#### Advanced profiler routines (precision PA)

The routines are C and C++ functions and Fortran subroutines for specifying a measurement section for PA information. By specifying a collection section in the source code, you can collect highly precise information.

| Function name    | Function                         |
|------------------|----------------------------------|
| start_collection | Information<br>measurement start |
| stop_collection  | Information<br>measurement end   |

#### Entire evaluation region

Insertion diagram



### 3. Breakdown to the Level of Hot Spots





Chapter 8 How to Effectively Use PA Information and Tuning Flows

### 4. Analysis and Diagnosis: Hot Spot (1)

ะบุ๊เารบ



### 5. Measures and Effects: Hot Spot (1)



Chapter 8 How to Effectively Use PA Information and Tuning Flows



### Analysis and Tuning of Each Hot Spot

- (Duplicate) Hot Spot (1): IF Construct in the Innermost Loop (Analysis and Diagnosis)
- (Duplicate) Hot Spot (1): IF Construct in the Innermost Loop (Measures and Effects)
- Hot Spot (2): Stride Access (Analysis and Diagnosis)
- Hot Spot (2): Stride Access (Measures and Effects)
- Hot Spot (3): Ideal Operation (Analysis and Diagnosis)
- Hot Spot (4): Data Dependency (Analysis and Diagnosis)
- Entire Evaluation Region (Measures and Effects)
- Summary

Hot Spot (1): IF Construct in the Innermost Loop (Analysis and Diagnosis)





Hot Spot (1): IF Construct in the Innermost Loop (Measures and Effects)





Chapter 8 Analysis and Tuning of Each Hot Spot

#### Hot Spot (2): Stride Access (Analysis and Diagnosis)





#### Hot Spot (2): Stride Access (Measures and Effects)





Chapter 8 Analysis and Tuning of Each Hot Spot

#### Hot Spot (3): Ideal Operation (Analysis and Diagnosis)





Hot Spot (4): Data Dependency (Analysis and Diagnosis)





### Entire Evaluation Region (Measures and Effects)



#### 🛛 PA graph



### Summary



- You can determine bottlenecks from the PA graph of an entire evaluation region.
- The bottleneck factors are often different for every loop. For this reason, a breakdown to the level of loops is necessary to analyzing and determining whether CPU tuning is possible and how to take measures for problems.



### Navigation from PA Information to Tuning Techniques

Tuning MapTuning Technique List

### Navigation from PA Information to Tuning Techniques



- The tuning map is useful for determining a specific tuning method from PA information.
- Tuning map
  - The tuning map is a list showing tuning proposals by bottleneck type. The list clearly shows what PA information to check and bottleneck factors (conditions) that occur by bottleneck classification, and summarizes the measures (tuning proposals: what to improve) for solving them.
    - $\Rightarrow$  1. Identify bottleneck factors from PA information.
    - $\Rightarrow$  2. Present measures (tuning proposals) for removing bottlenecks.

#### Tuning technique list

- This list summarizes various tuning techniques by tuning proposal.
   Select an effective tuning technique for improvement.
- For examples of actual measures, see the scalar tuning examples.

### Tuning Map (1/12)



| Bottleneck classification     | High cost as seen from PA graph                                                                                 | High cost as seen from PA information                                    | Condition                                                  | Tuning proposal                                                                                                                                             |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                               | No instruction commit due to memory access for a floating-<br>point load instruction                            | -                                                                        | Memory latency is a bottleneck.                            | Improvement in data access wait<br>- Dimensional displacement of an array                                                                                   |
|                               | No instruction commit due to memory access for an integer<br>load instruction                                   | -                                                                        | Memory latency is a bottleneck.                            | Prefetch-related improvement Improvement in data access wait     Dimensional displacement of an array     Prefetch-related improvement                      |
|                               | No instruction commit because SP (store port) is full                                                           | -                                                                        | The store instruction cost is a bottleneck.                | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement<br>- High-speed store (XFILL)                   |
| lemory bottleneck             | No instruction commit due to memory and cache busy                                                              | -                                                                        | Memory throughput is a bottleneck.                         | Improvement in data access wait     Improvement in data access wait     Dimensional displacement of an array     Loop blocking     High-speed store (XFILL) |
|                               | -                                                                                                               | High memory busy rate                                                    | Memory throughput is a bottleneck.                         | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking<br>- High-speed store (XFILL)                                  |
|                               | -                                                                                                               | High percentage of L2 misses<br>High percentage of L2 misses due to dm   | Memory latency is a bottleneck.                            | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking<br>- Prefetch-related improvement<br>- Thrashing               |
|                               | No instruction commit due to L2 access for a floating-point load instruction                                    | -                                                                        | L2 cache latency is a bottleneck.                          | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement                                                 |
|                               | No instruction commit due to L2 access for an integer load instruction                                          | -                                                                        | L2 cache latency is a bottleneck.                          | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement                                                 |
| 2 cache bottleneck            | -                                                                                                               | High L2 busy rate                                                        | L2 cache throughput is a bottleneck.                       | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking                                                                |
|                               | -                                                                                                               | High percentage of L1D misses<br>High percentage of L1D misses due to dm | L2 cache latency is a bottleneck.                          | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Thrashing                                                                    |
|                               | No instruction commit due to L1D access for a floating-<br>point load instruction                               | -                                                                        | L1 cache latency is a bottleneck.                          | Instruction scheduling improvement                                                                                                                          |
| 1 cache bottleneck            | No instruction commit due to L1D access for an integer load instruction                                         | -                                                                        | L1 cache latency is a bottleneck.                          | Instruction scheduling improvement                                                                                                                          |
|                               | -                                                                                                               | High L1 busy rate                                                        | L1 cache throughput is a bottleneck.                       | Improvement in data access wait<br>- Algorithm review                                                                                                       |
|                               | No instruction commit waiting for a floating-point instruction to be completed                                  | -                                                                        | Operation instruction latency is a bottleneck.             | Instruction scheduling improvement                                                                                                                          |
| cheduling<br>ottleneck        | No instruction commit waiting for an integer instruction to be<br>completed                                     | -                                                                        | Operation instruction latency is a bottleneck.             | Instruction scheduling improvement                                                                                                                          |
|                               | No instruction commit waiting for a branch instruction to be completed                                          | -                                                                        | A branch instruction is a bottleneck.                      | Instruction scheduling improvement<br>- IF statement removal<br>- Masked SIMD                                                                               |
| arallelization bottleneck     | Synchronous waiting time between threads                                                                        | -                                                                        | A part that is not thread parallelization is a bottleneck. |                                                                                                                                                             |
| oad imbalance bottleneck      | Synchronous waiting time between threads                                                                        | Large difference in the instruction balance between max and min          | A load imbalance between threads is a bottleneck.          | parallelization processing                                                                                                                                  |
| TLB bottleneck                | -                                                                                                               | High percentage of mDTLB misses                                          | TLB misses and TLB thrashing are a bottleneck.             | Improvement in the TLB bottleneck<br>- Elimination of thrashing<br>- Change of areas used<br>- Optimization using large page options                        |
|                               | -                                                                                                               | High percentage of uDTLB misses                                          | TLB misses are a bottleneck.                               | Improvement in the TLB bottleneck<br>- Page size expansion                                                                                                  |
| nstruction fetch              | No instruction commit waiting for an instruction to be<br>fetched                                               | -                                                                        | Instruction cache misses and thrashing are a bottleneck.   | Improvement in instruction fetch<br>- Reduction in the loop body<br>- Algorithm review<br>- Elimination of thrashing                                        |
| nstruction count<br>ottleneck | Four instructions commit<br>Instruction<br>Two or three instructions commit<br>commit<br>One instruction commit | -                                                                        | The number of instructions is a bottleneck.                | Improvement in the instruction count bottleneck<br>- Facilitation of SIMD optimization<br>- Prefetch-related improvement<br>- Inline expansion              |

Chapter 8 Navigation from PA Information to Tuning Techniques

### Tuning Map (2/12)

#### Bottleneck classifications



|                                                                  | No instruction commit due to memory access for a floating-point load instruction                                                            |  |  |
|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Momony bottlopock                                                | No instruction commit due to memory access for an integer load instruction                                                                  |  |  |
| Memory bottleneck                                                | No instruction commit because SP (store port) is full                                                                                       |  |  |
|                                                                  | No instruction commit due to memory and cache busy                                                                                          |  |  |
|                                                                  | No instruction commit due to L2 access for a floating-point load instruction                                                                |  |  |
| L2 cache bottleneck                                              | No instruction commit due to L2 access for an integer load instruction                                                                      |  |  |
| 1                                                                | No instruction commit due to L1D access for a floating-point load instruction                                                               |  |  |
| L1 cache bottleneck                                              | No instruction commit due to L1D access for an integer load instruction                                                                     |  |  |
|                                                                  | No instruction commit waiting for a floating-point instruction to be completed                                                              |  |  |
| Scheduling bottleneck                                            | No instruction commit waiting for an integer instruction to be completed                                                                    |  |  |
|                                                                  | No instruction commit waiting for a branch instruction to be completed                                                                      |  |  |
|                                                                  | Synchronous waiting time between threads                                                                                                    |  |  |
| Parallelization bottleneck                                       | Synchronous waiting time between threads                                                                                                    |  |  |
| Parallelization bottleneck<br>Load imbalance bottleneck          | Synchronous waiting time between threads<br>Synchronous waiting time between threads                                                        |  |  |
|                                                                  |                                                                                                                                             |  |  |
| Load imbalance bottleneck                                        |                                                                                                                                             |  |  |
| Load imbalance bottleneck<br>TLB bottleneck<br>Instruction fetch | Synchronous waiting time between threads<br>-<br>No instruction commit waiting for an instruction to be fetched<br>Four instructions commit |  |  |
| Load imbalance bottleneck<br>TLB bottleneck                      | Synchronous waiting time between threads<br>-<br>No instruction commit waiting for an instruction to be fetched<br>Four instructions commit |  |  |

Left: Bottleneck classifications Right: Costs as seen from PA graph

Chapter 8 Navigation from PA Information to Tuning Techniques

### Tuning Map (3/12)

#### Memory bottleneck

| High cost as seen from PA<br>graph                                               | High cost as seen from PA information                                     | Condition                                   | Tuning proposal                                                                                                                               |
|----------------------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| No instruction commit due to memory access for a floating-point load instruction | -                                                                         | Memory latency is a bottleneck.             | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement                                   |
| No instruction commit due to memory access for an integer load instruction       | _                                                                         | Memory latency is a bottleneck.             | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement                                   |
| No instruction commit<br>because SP (store port) is full                         |                                                                           | The store instruction cost is a bottleneck. | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Prefetch-related improvement<br>- High-speed store (XFILL)     |
| No instruction commit due to memory and cache busy                               |                                                                           | Memory throughput is a<br>bottleneck.       | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking<br>- High-speed store (XFILL)                    |
| -                                                                                | High memory busy rate                                                     | Memory throughput is a<br>bottleneck.       | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking<br>- High-speed store (XFILL)                    |
| -                                                                                | High percentage of L2 misses<br>High percentage of L2 misses<br>due to dm | Memory latency is a bottleneck.             | Improvement in data access wait<br>- Dimensional displacement of an array<br>- Loop blocking<br>- Prefetch-related improvement<br>- Thrashing |

FUJITSU

### Tuning Map (4/12)

#### L2 cache bottleneck

| High cost as seen from<br>PA graph            | High cost as seen from<br>PA information                                       | Condition                               | Tuning proposal                                                                                                                             |
|-----------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| No instruction commit due to L2 access for a  | _                                                                              | L2 cache latency is a bottleneck.       | Improvement in data access wait<br>- Dimensional displacement of an                                                                         |
| floating-point load<br>instruction            |                                                                                |                                         | array<br>- Prefetch-related improvement                                                                                                     |
| No instruction commit due to L2 access for an | _                                                                              | L2 cache latency is a bottleneck.       | Improvement in data access wait<br>- Dimensional displacement of an                                                                         |
| integer load instruction                      |                                                                                |                                         | array<br>- Prefetch-related improvement                                                                                                     |
| -                                             |                                                                                | L2 cache throughput is a<br>bottleneck. | Improvement in data access wait<br>- Dimensional displacement of an<br>array<br>- Loop blocking                                             |
| -                                             | High percentage of L1D<br>misses<br>High percentage of L1D<br>misses due to dm | L2 cache latency is a bottleneck.       | <ul> <li>Loop blocking</li> <li>Improvement in data access wait</li> <li>Dimensional displacement of an array</li> <li>Thrashing</li> </ul> |

FUJITSU

### Tuning Map (5/12)

#### L1 cache bottleneck

| High cost as seen from PA graph | PA information     | Condition                               | Tuning proposal                                       |
|---------------------------------|--------------------|-----------------------------------------|-------------------------------------------------------|
| No instruction commit due       |                    | L1 cache latency is a bottleneck.       | Instruction scheduling                                |
| to L1D access for a             | _                  |                                         | improvement                                           |
| floating-point load             | -                  |                                         |                                                       |
| instruction                     |                    |                                         |                                                       |
| No instruction commit due       |                    | L1 cache latency is a bottleneck.       | Instruction scheduling                                |
| to L1D access for an            | -                  |                                         | improvement                                           |
| integer load instruction        |                    |                                         |                                                       |
| -                               | lligh 11 hugy rate | L1 cache throughput is a<br>bottleneck. | Improvement in data access wait<br>- Algorithm review |



### Tuning Map (6/12)

#### Scheduling bottleneck

| High cost as seen from<br>PA graph                                                       | High cost as seen from<br>PA information | Condition                                      | Tuning proposal                                                                  |
|------------------------------------------------------------------------------------------|------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------|
| No instruction commit<br>waiting for a floating-<br>point instruction to be<br>completed |                                          | Operation instruction latency is a bottleneck. | Instruction scheduling<br>improvement                                            |
| No instruction commit<br>waiting for an integer<br>instruction to be<br>completed        |                                          | Operation instruction latency is a bottleneck. | Instruction scheduling<br>improvement                                            |
| No instruction commit<br>waiting for a branch<br>instruction to be<br>completed          |                                          |                                                | Instruction scheduling<br>improvement<br>- IF statement removal<br>- Masked SIMD |



### Tuning Map (7/12)

#### Parallelization bottleneck

| High cost as seen from PA<br>graph          | High cost as seen from<br>PA information | Condition | Tuning proposal                             |
|---------------------------------------------|------------------------------------------|-----------|---------------------------------------------|
| Synchronous waiting time<br>between threads |                                          | ·         | Thread parallelization ratio<br>improvement |



### Tuning Map (8/12)

#### Load imbalance bottleneck

| High cost as seen from<br>PA graph | High cost as seen from PA information | Condition                | Tuning proposal                                                             |
|------------------------------------|---------------------------------------|--------------------------|-----------------------------------------------------------------------------|
| time between threads               |                                       | threads is a bottleneck. | Execution efficiency<br>improvement of thread<br>parallelization processing |



### Tuning Map (9/12)

| TLB bottleneck                                                             |                                    |                                                   |                                                                                                                                      |  |
|----------------------------------------------------------------------------|------------------------------------|---------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|--|
| High cost as seen from<br>PA graphHigh cost as seen from<br>PA information |                                    | Condition                                         | Tuning proposal                                                                                                                      |  |
|                                                                            |                                    | TLB misses and TLB thrashing<br>are a bottleneck. | Improvement in the TLB bottleneck<br>- Elimination of thrashing<br>- Change of areas used<br>- Optimization using large page options |  |
|                                                                            | High percentage of<br>uDTLB misses | TLB misses are a bottleneck.                      | Improvement in the TLB bottleneck<br>- Page size expansion                                                                           |  |

FUJITSU

### Tuning Map (10/12)

#### Instruction fetch

| High cost as seen from<br>PA graph                                   | High cost as seen from PA information | Condition                                                   | Tuning proposal                                                                                                      |
|----------------------------------------------------------------------|---------------------------------------|-------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| No instruction commit<br>waiting for an instruction<br>to be fetched |                                       | Instruction cache misses and<br>thrashing are a bottleneck. | Improvement in instruction fetch<br>- Reduction in the loop body<br>- Algorithm review<br>- Elimination of thrashing |



### Tuning Map (11/12)

#### Instruction count bottleneck

| High cost as seen from PA graph |                                                                                                 | High cost as seen from PA information | Condition | Tuning proposal                                                                                                                                   |
|---------------------------------|-------------------------------------------------------------------------------------------------|---------------------------------------|-----------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Instruction<br>commit           | Four instructions<br>commit<br>Two or three<br>instructions commit<br>One instruction<br>commit |                                       |           | Improvement in the instruction<br>count bottleneck<br>- Facilitation of SIMD optimization<br>- Prefetch-related improvement<br>- Inline expansion |



### Tuning Map (12/12)

#### Other

| High cost as seen from<br>PA graph         | High cost as seen from<br>PA information | Condition                                 | Tuning proposal  |
|--------------------------------------------|------------------------------------------|-------------------------------------------|------------------|
| No instruction commit<br>for other reasons | _                                        | PA may have not been collected correctly. | PA re-collection |



# Tuning Technique List (1/2)

## FUĴĨTSU

## Major classifications

| Thread parallelization ratio<br>improvement                              | improvement of<br>thread parallelizatior<br>processing |                                                                       | Instruction scheduling<br>improvement                                                                                |                                          | instruction fetch                | Improvement in the<br>instruction count<br>bottleneck |
|--------------------------------------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|------------------------------------------|----------------------------------|-------------------------------------------------------|
| NORECURRENCE specifier<br>(Facilitation of automatic<br>parallelization) |                                                        | displacement of an array                                              | Software pipelining                                                                                                  |                                          | body                             | Facilitation of SIMD<br>optimization                  |
| NOALIAS specifier<br>(Facilitation of automatic<br>parallelization)      | Parallelized<br>dimension change                       | Prefetch-related<br>improvement                                       | Unrolling                                                                                                            |                                          | (Expansion of the problem scale) | Prefetch-related<br>improvement                       |
| Peeling<br>(Facilitation of automatic<br>parallelization)                | change (Cyclic)                                        |                                                                       | Facilitation of SIMD optimization                                                                                    | Change of areas used                     | Elimination of thrashing         | Inline expansion                                      |
| OpenMP parallelization                                                   | Division method<br>change (Dynamic)                    | Loop blocking                                                         | IF statement removal                                                                                                 | Optimization using<br>large page options |                                  |                                                       |
|                                                                          | Parallelization<br>algorithm review                    | ,                                                                     | Masked SIMD                                                                                                          |                                          |                                  |                                                       |
|                                                                          |                                                        | Algorithm review<br>(Reducing the memory<br>access instruction ratio) | Outer unrolling                                                                                                      |                                          |                                  |                                                       |
|                                                                          |                                                        | Array division                                                        | Suppression of software<br>pipelining & specification of the<br>number of unrollings (Loop with<br>a few iterations) |                                          |                                  |                                                       |
|                                                                          |                                                        | Loop fission                                                          | Rerolling                                                                                                            |                                          |                                  |                                                       |
|                                                                          |                                                        | Strip mining                                                          | Peeling                                                                                                              |                                          |                                  |                                                       |
|                                                                          |                                                        | Sector cache<br>Loop interchange                                      | NORECURRENCE specifier NOALIAS specifier                                                                             |                                          |                                  |                                                       |
|                                                                          |                                                        | Loop fusion                                                           |                                                                                                                      |                                          |                                  |                                                       |
|                                                                          |                                                        | Array merging                                                         |                                                                                                                      |                                          |                                  |                                                       |

\* The colored items represent medium classifications. If applicable, go to the next page.

# Tuning Technique List (2/2)

## Medium classifications

| Prefetch-related improvement        | Facilitation of SIMD optimization                                   | Elimination of thrashing                | Reduction in the loop body         |
|-------------------------------------|---------------------------------------------------------------------|-----------------------------------------|------------------------------------|
| Addition of prefetching             | Changing arrays to simple variables                                 | Padding                                 | Suppression of software pipelining |
| Deletion of unnecessary prefetching | Loop unswitching                                                    | Dimensional<br>displacement of an array | Suppression of unrolling           |
| Prefetching toward the outer loop   | IF statement removal                                                | Array merging                           | Suppression of loop fusion         |
| Indirect access prefetching         | Rerolling                                                           | Reduction in the loop<br>body           | Loop fission                       |
|                                     | Inline expansion                                                    |                                         |                                    |
|                                     | Loop fission (separating dependent<br>accesses)                     |                                         |                                    |
|                                     | Loop fission (loop extraction) for a part<br>with a high true ratio |                                         |                                    |
|                                     | Cloning                                                             |                                         |                                    |
|                                     | NORECURRENCE specifier                                              |                                         |                                    |
|                                     | NOALIAS specifier                                                   |                                         |                                    |

FUITSU



# Scalar Tuning

- Improvement in Data Access Wait (Improvement in Thrashing)
- Improvement in Data Access Wait (Increase in Data Locality)
- Improvement in Data Access Wait (Latency Concealment)
- Improvement in Data Access Wait (Reduced Amount of Access)
- Improvement in Operation Wait (Instruction Scheduling Improvement)



## Improvement in Data Access Wait (Improvement in Thrashing)

Improvement in Cache Thrashing
 Improvement in TLB Thrashing



# Improvement in Cache Thrashing

- What Is Cache Thrashing?
- Tuning Approach to Cache Thrashing (Basics)
- Tuning Approach to Cache Thrashing (Application)

# What Is Cache Thrashing?





# Tuning Approach to Cache Thrashing (Basics)

- Tuning Approach (Basics)
- Array Merging
- Dimensional Displacement of an Array
- Loop Fission
- Padding

Chapter 8 Tuning Approach to Cache Thrashing (Basics)

# Tuning Approach (Basics)





Chapter 8 Tuning Approach to Cache Thrashing (Basics)



# Array Merging

- What Is Array Merging?
- Array Merging (Before Improvement)
- Effects of Array Merging (Source Tuning)
- Array Merging (in C Language) (Before Improvement)
- Effects of Array Merging (in C Language) (Source Tuning)
- Effects of Array Merging (Compiler Options Tuning)

# What Is Array Merging?



## Array merging is tuning that merges multiple arrays into one array.

### Use conditions

Each array to be merged has the same number of elements.

#### Purpose

- The purpose is to reduce the number of streams.
- Adverse effect
  - Load and store instructions become stride or indirect instructions.



# Array Merging (Before Improvement)



L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                       | L11 miss rate(/Effective<br>instruction) | L1D miss rate<br>(/Load-store<br>instruction) | III) micc |        | L1D miss hwpf<br>rate(/L1D miss) | L1D miss swpf<br>rate(/L1D miss) | 1     | Memory throughput<br>(GB/sec) | L2 throughput<br>(GB/sec) |
|-----------------------|------------------------------------------|-----------------------------------------------|-----------|--------|----------------------------------|----------------------------------|-------|-------------------------------|---------------------------|
| Before<br>improvement | 0.00%                                    | 23.21%                                        | 3.12E+09  | 91.66% | 8.34%                            | 0.00%                            | 0.00% | 0.00                          | 261.73                    |

Chapter 8 Array Merging

Copyright 2016 FUJITSU LIMITED

The percentage of L1D misses is high and the L1 miss dm percentage is high,

despite the fact that the array is accessed sequentialy.

 $\equiv$  L1D cache thrashing has occurred.

# Effects of Array Merging (Source Tuning)

Array merging reduced the number of streams from eight to two, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                       | L1I miss<br>rate(/Effective<br>instruction) | L1D miss rate<br>(/Load-store<br>instruction) | L1D miss | L1D miss dm<br>rate(/L1D miss) | L1D miss hwpf<br>rate(/L1D miss) | L1D miss swpf<br>rate(/L1D miss) | L2 miss rate(/Load-<br>store instruction) | Memory<br>throughput<br>(GB/sec) | L2 throughput (GB/sec) |
|-----------------------|---------------------------------------------|-----------------------------------------------|----------|--------------------------------|----------------------------------|----------------------------------|-------------------------------------------|----------------------------------|------------------------|
| Before<br>improvement | 0.00%                                       | 23.21%                                        | 3.12E+09 | 91.66%                         | 8.34%                            | 0.00%                            | 0.00%                                     | 0.00                             | 261.73                 |
| After<br>improvement  | 0.00%                                       | 3.19%                                         | 4.29E+08 | 25.52%                         | 74.48%                           | 0.00%                            | 0.00%                                     | 0.01                             | 335.65                 |

The percentage of L1D cache misses decreased from 23.21% to 3.19%, and the L1D miss dm percentage decreased too from 91.66% to 25.52%.

## Array Merging (in C Language) (Before Improvement) Fujirsu

L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                       | L1I miss rate(/Effective instruction) | L1D miss rate<br>(/Load-store<br>inst <u>ruction)</u> | L1D miss | L1D miss dm<br>rate(/L1D miss) | L1D miss hwpf<br>rate(/L1D miss) |       |       | Memory throughput<br>(GB/sec) | L2 throughput<br>(GB/sec) |
|-----------------------|---------------------------------------|-------------------------------------------------------|----------|--------------------------------|----------------------------------|-------|-------|-------------------------------|---------------------------|
| Before<br>improvement | 0.00%                                 | 21.95%                                                | 2.95E+09 | 91.96%                         | 8.04%                            | 0.00% | 0.00% | 0.00                          | 235.97                    |
|                       |                                       |                                                       |          |                                |                                  |       |       |                               |                           |

The percentage of L1D misses is high and the L1 miss dm percentage is high, despite the fact that the array is accessed sequentialy.

## Effects of Array Merging (in C Language) (Source Tuning) Fujitsu

Array merging reduced the number of streams from eight to two, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



The percentage of L1D cache misses decreased from 21.95% to 3.18%, and the L1D miss dm percentage decreased too from 91.96% to 27.49%.



# You can achieve effects similar to source tuning by specifying the following compiler options.

| Compiler options                         | Description of function                                                                                                                                                                   |
|------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -Karray_merge_common<br>[= <i>name</i> ] | Gives an instruction to merge multiple arrays in a common block. You can specify a common block name for name. If name is omitted, the arrays in all the named common blocks are targets. |
| -Karray_merge_local                      | Gives an instruction to merge multiple local arrays.<br>-Karray_merge_local_size=1000000 is also valid at the same time.                                                                  |
| -Karray_merge                            | This option is equivalent to specifying the -Karray_merge_local and -Karray_merge_common options.                                                                                         |

## Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Karray\_merge\_common

## Notes

- •Options must be specified for all source code that uses the target arrays.
- •The effects of array merging vary depending on the program.
- •Incorrect use may result in different computational results.
- •These options cannot be used with debug options (-g and -Haesux).

# FUJITSU

# Dimensional Displacement of an Array

- What Is Dimensional Displacement of an Array?
- Dimensional Displacement of an Array (Before Improvement)
- Effects of Dimensional Displacement of an Array (Source Tuning)
- Dimensional Displacement of an Array (in C Language) (Before Improvement)
- Effects of Dimensional Displacement of an Array (in C Language) (Source Tuning)
- Effects of Dimensional Displacement of an Array (Compiler Options Tuning)

## What Is Dimensional Displacement of an Array?



Dimensional displacement of an array is a tuning method where multiple streams of the same array become one stream.

### Use conditions

Multiple streams exist in the same array. \* a(1,1,1) to a(1,1,8) are shown as multiple streams.

### Purpose

- The purpose is to reduce the number of streams.
- Adverse effect
  - SIMD optimization of load and store instructions is more difficult.



Chapter 8 Dimensional Displacement of an Array

## Dimensional Displacement of an Array (Before Improvement) Fujitsu

L1D cache thrashing occurs because each stream of array a is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



53

## Effects of Dimensional Displacement of an Array (Source Tuning)

Dimensional displacement of an array reduced the number of streams from eight to one, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

| L11 miss<br>(effective<br>instructio | (/L   | D miss rate<br>Load-store I<br>struction) | III) micc I |        | L1D miss hwpf<br>rate(/L1D miss) | L1D miss swpf<br>rate(/L1D miss) | ,     | 51     | Memory throughput<br>(GB/sec) |
|--------------------------------------|-------|-------------------------------------------|-------------|--------|----------------------------------|----------------------------------|-------|--------|-------------------------------|
| Before improvement                   | ).00% | 23.26%                                    | 3.13E+09    | 91.57% | 8.43%                            | 0.00%                            | 0.00% | 261.00 | 0.00                          |
| After improvement                    | ).00% | 3.16%                                     | 4.27E+08    | 20.03% | 79.97%                           | 0.00%                            | 0.00% | 207.01 | 0.00                          |

# The percentage of L1D misses decreased from 23.26% to 3.16%, and the L1D miss dm percentage decreased too from 91.57% to 20.03%.

### Dimensional Displacement of an Array (in C Language) (Before Improvement)



L1D cache thrashing occurs because each stream of array a is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



### Effects of Dimensional Displacement of an Array (in C Language) (Source Tuning)

Dimensional displacement of an array reduced the number of streams from eight to one, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                    | L1I miss rate<br>(effective<br>instruction) | L1D miss rate<br>(/Load-store<br>instruction) | L1D miss | L1D miss dm<br>rate(/L1D miss) | L1D miss hwpf rate(/L1D<br>miss) |       | L2 miss rate(/Load-<br>store instruction) | L2 throughput | Memory<br>throughput<br>(GB/sec) |
|--------------------|---------------------------------------------|-----------------------------------------------|----------|--------------------------------|----------------------------------|-------|-------------------------------------------|---------------|----------------------------------|
| Before improvement | 0.00%                                       | 21.92%                                        | 2.95E+09 | 91.93%                         | 8.07%                            | 0.00% | 0.00%                                     | 235.07        | 0.00                             |
| After improvement  | 0.00%                                       | 3.16%                                         | 4.27E+08 | 22.84%                         | 77.16%                           | 0.00% | 0.00%                                     | 199.05        | 0.00                             |

# The percentage of L1D misses decreased from 21.92% to 3.16%, and the L1D miss dm percentage decreased too from 91.93% to 22.84%.

FUITSU



### You can achieve effects similar to source tuning by specifying the following compiler options.

| Compiler options                                       | Description of function                                                                                                                                                                                                                                                                                             |  |  |  |  |  |  |
|--------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| -Karray_subscript                                      | Gives an instruction for dimensional displacement of allocatable arrays with 4 or more dimensions and arrays with 4 or more dimensions containing 10 or fewer elements in the final dimension and 100 or more elements in the other dimensions.                                                                     |  |  |  |  |  |  |
|                                                        | -Karray_subscript_element=100,-Karray_subscript_elementlast=10,                                                                                                                                                                                                                                                     |  |  |  |  |  |  |
|                                                        | and -Karray_subscript_rank=4 are also valid at the same time.                                                                                                                                                                                                                                                       |  |  |  |  |  |  |
| -Karray_subscript_element=N<br>(2≦N≦2,147,483,647)     | Gives an instruction that the number of elements in a dimension other than the final<br>dimension in an array subject to dimensional displacement be N or greater. This option has<br>meaning in cases where the -Karray_subscript option is valid. However, the option has no<br>meaning for an allocatable array. |  |  |  |  |  |  |
| -Karray_subscript_elementlast=N<br>(2≦N≦2,147,483,647) | Gives an instruction that the number of elements in the final dimension of an array subject to dimensional displacement be N or less. This option has meaning in cases where the -Karray_subscript option is valid. However, the option has no meaning for an allocatable array.                                    |  |  |  |  |  |  |
| -Karray_subscript_rank=N<br>(2≦N≦30)                   | Gives an instruction that the number of dimensions of an array subject to dimensional displacement be N or greater. This option has meaning in cases where the -Karray_subscript option is valid.                                                                                                                   |  |  |  |  |  |  |
| Use example (source code before improvement)           |                                                                                                                                                                                                                                                                                                                     |  |  |  |  |  |  |

### \$ frtpx -Kfast,parallel sample.f90

-Karray\_subscript,array\_subscript\_rank=2,array\_subscript\_element=2

### Notes

- •Options must be specified for all source code that uses the target arrays.
- •The effects of displacement vary depending on the program.
- •Incorrect use may result in different computational results.



# Loop Fission

- Loop Fission (Before Improvement)
- Effects of Loop Fission (Source Tuning)
- Loop Fission (in C Language) (Before Improvement)
- Effects of Loop Fission (in C Language) (Source Tuning)
- Effects of Loop Fission (Optimization Control Line Tuning)

## Loop Fission (Before Improvement)



L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



# Effects of Loop Fission (Source Tuning)



Loop fission reduced the number of streams from eight to four, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

| Cacine             |            |                                               |            |                                |                       |        |                                  |                                       |        |                               |
|--------------------|------------|-----------------------------------------------|------------|--------------------------------|-----------------------|--------|----------------------------------|---------------------------------------|--------|-------------------------------|
|                    | (effective | L1D miss rate<br>(/Load-store<br>instruction) | L1D miss   | L1D miss dm<br>rate(/L1D miss) | L1D miss<br>rate(/L1[ |        | L1D miss swpf rate(/L1D<br>miss) | L2 miss rate(/Load-store instruction) |        | Memory throughput<br>(GB/sec) |
| Before improvement | 0.00%      | 9.03%                                         | 4.74E+07   | 73.50%                         |                       | 26.50% | 0.00%                            | 0.00%                                 | 222.75 | 0.18                          |
| After improvement  | 0.00%      | 3.25%                                         | 5 1.71E+07 | 15.93%                         |                       | 84.07% | 0.00%                            | 0.00%                                 | 341.98 | 0.70                          |

## Loop Fission (in C Language) (Before Improvement) Fujitsu

L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



## Effects of Loop Fission (in C Language) (Source Tuning) Fujitsu

Loop fission reduced the number of streams from eight to four, so L1D cache thrashing was avoided. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                    | L11 miss rate<br>(effective<br>instruction) | L1D miss rate<br>(/Load-store<br>instruction) |       | L1D miss I |        | L1D m <del>iss hwpf</del> rate(/L1D<br>miss) |       | ,     |        | Memory throughput<br>(GB/sec) |
|--------------------|---------------------------------------------|-----------------------------------------------|-------|------------|--------|----------------------------------------------|-------|-------|--------|-------------------------------|
| Before improvement | 0.00%                                       | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~       | 8.76% | 4.60E+07   | 73.69% | 26.31%                                       | 0.00% | 0.00% | 221.43 | 0.03                          |
| After improvement  | 0.00%                                       |                                               | 3.23% | 1.70E+07   | 20.84% | 79.16%                                       | 0.00% | 0.00% | 353.87 | 0.02                          |

## Effects of Loop Fission (Optimization Control Line Tuning)



### You can achieve effects similar to source tuning by specifying the following optimization control line.

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                      | Optimiz         | ation control l | ine that can b    | e specified                              |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-----------------|-------------------|------------------------------------------|
| Optimization control specifiers                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Meaning                                                                                                                                                                                              | Program<br>unit | DO loop<br>unit | Statement<br>unit | Array<br>assignment<br>statement<br>unit |
| !OCL FISSION_POINT[( <i>n</i> 1)]<br>(where <i>n1</i> is decimal<br>number from 1 to 6)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Gives an instruction for loop fission at the specified point<br>inside a loop. The loop fission divides multiple loops that<br>have loops nested to n1 levels (counting from the innermost<br>loop). | No              | No              | Yes               | No                                       |
| Source code after im                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | provement (optimization control line tuning)                                                                                                                                                         |                 |                 |                   |                                          |
| 46       parameter (n=655         47       real*8 a (n),b(n),         48       common /com/a,b         49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          49          <<       Standard iterat         <<       SPLIT         <<<       SOFTWARE PIPI         <<<       SOFTWARE PIPI         <<<       s/d(i)         52       1       p         55       1       p         56       1       p         sin d       g(i) = s/h(i)      < | <pre>(n),d(n),e(n),f(n),g(n),h(n) ),c,d,e,f,g,h n Start &gt;&gt;&gt; ON] ion count: 381 ELINING n End &gt;&gt;&gt; (1)</pre>                                                                         |                 |                 |                   |                                          |



# Padding

- What Is Padding?
- Padding That Increases the Number of Array Elements in the First Dimension
- Padding That Increases the Number of Array Elements in the Second Dimension
- Padding Using a Dummy Array
- Padding Using a Dummy Array (for Arrays of Different Sizes)

# What Is Padding?



### Padding inserts a dummy area between arrays or inside an array.

### Use conditions

#### Purpose

Multiple streams exist in the same array. Alternatively,

Multiple arrays exist.

The purpose is to create a temporary area to shift addresses.

### Adverse effect

The amount of padding must be changed every time that the problem scale changes.

#### Example where multiple streams exist in the same array





# Padding That Increases the Number of Array Elements in the First Dimension

- Padding That Increases the Number of Array Elements in the First Dimension (Before Improvement)
- Effects of Padding That Increases the Number of Array Elements in the First Dimension (Source Tuning)
- Padding That Increases the Number of Array Elements in the First Dimension (in C Language) (Before Improvement)
- Effects of Padding That Increases the Number of Array Elements in the First Dimension (in C Language) (Source Tuning)
- Effects of Padding That Increases the Number of Array Elements in the First Dimension (Compiler Options Tuning)

L1D cache thrashing occurs because each stream of array a is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

|                    | (effective instruction) | L1D miss rate<br>(/Load-store<br>instru <u>ction)</u> | L1D miss |        |       | L1D miss swpf<br>rate(/L1D miss) | L2 miss rate(/Load-<br>store instruction) | L2 throughput | Memory<br>throughput<br>(GB/sec) |
|--------------------|-------------------------|-------------------------------------------------------|----------|--------|-------|----------------------------------|-------------------------------------------|---------------|----------------------------------|
| Before improvement | 0.00%                   | 33.19%                                                | 4.47E+09 | 95.12% | 4.88% | 0.00%                            | 0.00%                                     | 247.59        | 0.00                             |

The percentage of L1D cache misses is high and the demand percentage of L1D cache misses is high, despite the fact that the array is accessed sequentialy.



## L1D cache thrashing was avoided because of padding (+1) of the first dimension of each stream of array a. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



The percentage of L1D misses decreased from 33.19% to 3.27%, and the L1D miss dm percentage decreased too from 95.12% to 9.46%.

|                       |       | L1D miss rate<br>(/Load-store instruction)- | L1D miss | II III miss am ratel/I III missi |        |       | L2 miss rate(/Load-<br>store instruction) |        | Memory throughput<br>(GB/sec) |
|-----------------------|-------|---------------------------------------------|----------|----------------------------------|--------|-------|-------------------------------------------|--------|-------------------------------|
| Before<br>improvement | 0.00% | 33.19%                                      | 4.47E+09 | 95.12%                           | 4.88%  | 0.00% | 0.00%                                     | 247.59 | 0.00                          |
| After<br>improvement  | 0.00% | 3.27%                                       | 4.39E+08 | 9.46%                            | 90.54% | 0.00% | 0.00%                                     | 421.35 | 0.01                          |

Cache

# Padding That Increases the Number of Array Elements in the First Dimension (in C Language) (Before Improvement)



L1D cache thrashing occurs because each stream of array a is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



# Effects of Padding That Increases the Number of Array Elements in the First Dimension (in C Language) (Source Tuning)



L1D cache thrashing was avoided because of padding (+1) of the first dimension of each stream of array a. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



Chapter 8 Padding That Increases the Number of Array Elements in the First Dimension



# You can achieve effects similar to source tuning by specifying the following compiler options.

| Compiler options                            | Description of function                                                                                                                                                                                                                                                                       |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -Karraypad_const[=N]<br>(1≦N≦2,147,483,647) | Pads <i>N</i> elements of an array whose first dimension is an explicit shape specification and shape specification expression is a constant expression. If <i>N</i> is omitted, the compiler determines the amount of padding for each target array. The padding creates a gap in the array. |
| -Karraypad_expr=N<br>(1≦N≦2,147,483,647)    | Pads <i>N</i> elements of an array whose first dimension is an explicit shape specification, regardless of whether its shape specification expression is a constant expression.                                                                                                               |

## Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Karraypad\_expr=1

Notes

- •Options must be specified for all source code that uses the target arrays.
- •The effects of padding vary depending on the program.
- •Incorrect use may result in different computational results.
- •The -Karraypad\_const [=N] option and -Karraypad\_expr=N option cannot be specified at the same time.



# Padding That Increases the Number of Array Elements in the Second Dimension

- Case of No Improvement from Padding That Increases the Number of Array Elements in the First Dimension
- Padding That Increases the Number of Array Elements in the Second Dimension
- Padding That Increases the Number of Array Elements in the Second Dimension (Before Improvement)
- Effects of Padding That Increases the Number of Array Elements in the Second Dimension (Source Tuning)





Depending on the array size, there may be no improvement even with padding (+1) of the array elements of the first dimension.

Chapter 8 Padding That Increases the Number of Array Elements in the Second Dimension



## L1D cache thrashing is avoided because padding (+1) of the second dimension causes a shift from the 16-KB boundary.



Thrashing is avoided because the 16-KB boundary is no longer valid.

Padding That Increases the Number of Array Elements in the Second Dimension (Before Improvement)FUITSU

L1D cache thrashing occurs because each stream of array a is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



Chapter 8 Padding That Increases the Number of Array Elements in the Second Dimension



L1D cache thrashing was avoided because of padding (+1) of the second dimension of each stream of array a. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### Cache

### The percentage of L1D misses decreased from 26.31% to 6.12%, and the L1D miss dm percentage decreased too from 92.60% to 52.53%.

|                       | LTT miss rate<br>(effective instruction) | L1D miss rate<br>(/Load-store<br>instruction) | IIII) micc | /      | L1D miss hwpf rate(/L1D<br>miss) | L1D miss swpf rate(/L1D<br>miss) | L2 miss rate(/Load-store<br>instruction) | L2 throughput (GB/sec) | Memory throughput<br>(GB/sec) |
|-----------------------|------------------------------------------|-----------------------------------------------|------------|--------|----------------------------------|----------------------------------|------------------------------------------|------------------------|-------------------------------|
| Before<br>improvement | 0.00%                                    | 26.31%                                        | 4.42E+08   | 92.60% | 7.40%                            | 0.00%                            | 0.00%                                    | 246.39                 | 0.01                          |
| After<br>improvement  | 0.00%                                    | 6.12%                                         | 1.03E+08   | 52.53% | 47.47%                           | 0.00%                            | 0.00%                                    | 456.13                 | 0.02                          |

Chapter 8 Padding That Increases the Number of Array Elements in the Second Dimension



# Padding Using a Dummy Array

- Padding Using a Dummy Array (Before Improvement)
- Effects of Padding Using a Dummy Array (Source Tuning)
- Effects of Padding Using a Dummy Array (Compiler Options Tuning)

### Padding Using a Dummy Array (Before Improvement)



L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



## Effects of Padding Using a Dummy Array (Source Tuning)

L1D cache thrashing was avoided because a dummy array was inserted between arrays to cause a shift from the 16-KB boundary. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.





# You can achieve effects similar to source tuning by specifying the following compiler options.

| Compiler options    | Description of function                                                                                                  |
|---------------------|--------------------------------------------------------------------------------------------------------------------------|
| -Kcommonpad[=N]     | Specifies that a gap be created between areas for variables in a common block to increase the data cache use efficiency. |
| (4≦N≦2,147,483,644) | If <i>N</i> is omitted, the compiler automatically determines the optimal value.                                         |

Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Kcommonpad=512

#### 

#### Notes

- For separate compilation with the compiler options -Kcommonpad specified for a file containing a common block, this option must also be specified for other files containing common blocks of the same name.
- For compilation with the compiler options -Kcommonpad=N specified for multiple files, the value of N must be the same.
- Also, if programs with the compiler options -Kcommonpad specified use the same common block name with its elements changed, the programs may not run correctly.



# Padding Using a Dummy Array (for Arrays of Different Sizes)

- Conflict between Arrays of Different Sizes
- Padding Using a Dummy Array (for Arrays of Different Sizes: Before Improvement)
- Effects of Padding Using a Dummy Array (for Arrays of Different Sizes: Source Tuning)

## Conflict between Arrays of Different Sizes (1/2)



#### Generally, stationary cache thrashing does not occur for arrays of different sizes.





Chapter 8 Padding Using a Dummy Array (for Arrays of Different Sizes)

## Conflict between Arrays of Different Sizes (2/2)

FUjitsu

Stationary cache thrashing occurs because an array remains on a 16-KB boundary. This happens even in cases with arrays of different sizes, depending on the array size.



Chapter 8 Padding Using a Dummy Array (for Arrays of Different Sizes)

#### Padding Using a Dummy Array (for Arrays of Different Sizes: Before Improvement)



L1D cache thrashing occurs because each array is located on a 16-KB boundary. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.

| Source code before improvement                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                              |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 50subroutine sub()51integer k,l,n,m5216-KB boundary even53parameter(n=256,m=256)when second dimension                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | [sec]<br>3.5E+00                                                                                                                                                                                                                             |
| 53parameter (n=256,m=256)when second dimension54parameter (k=2304,l=256)incremented5556real*8 a (n,m), b (n,m), c (n,m), d (n,m), e (n,m), f (n,m), g (n,m), h (k,l)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 3.0E+00                                                                                                                                                                                                                                      |
| 57<br>58<br>59 common /test/a,b,c,d,e,f,g,h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 2.5E+00                                                                                                                                                                                                                                      |
| <<< Loop-information Start >>><br><<< [PARALLELIZATION]<br><<< Standard iteration count: 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 2.0E+00                                                                                                                                                                                                                                      |
| <<< Loop-information End >>><br>60 1 pp do j = 1 , m<br><<< Loop-information Start >>>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 1.5E+00 No<br>instruction<br>commit due                                                                                                                                                                                                      |
| <<< [OPTIMIZATION]<br><<< SIMD(VL: 4)<br><<< SOFTWARE PIPELINING                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 1.0E+00 to L2 access<br>for a floating-<br>point load                                                                                                                                                                                        |
| <pre>&lt;&lt;&lt; Loop-information End &gt;&gt;&gt; 61 2 p 4v do i = 1, n 62 2 p 4v a(i, j) = b(i, j) + c(i, j) + d(i, j) + e(i, j) + f(i, j) + g(i, j) + h(i, j) 63 2 p 4v enddo The percentage of L1D misses</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 5.0E-01                                                                                                                                                                                                                                      |
| 63 2 p 4v enddo<br>64 1 p enddo<br>percentage of L1D misses<br>percentage of L1D cache misses                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                              |
| Cache | entially. Before improvement                                                                                                                                                                                                                 |
| L11 miss rate<br>(effective<br>instruction)L1D miss rate<br>(/Load-store<br>instruction)L1D miss dm<br>rate(/L1D miss)Before improvement0.00%19.14%2.57E+0990.12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | L1D miss hwpf<br>rate (/L1D miss)     L1D miss swpf<br>rate (/L1D miss)     L2 miss rate (/Load-<br>store instruction)     L2 throughput<br>(GB/sec)     Memory<br>throughput<br>(GB/sec)       9.87%     0.00%     0.00%     221.82     0.0 |

#### Padding Using a Dummy Array (for Arrays of Different Sizes: Source Tuning)

FUjitsu

L1D cache thrashing was avoided because a dummy array was inserted between arrays to cause a shift from the 16-KB boundary. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



Chapter 8 Padding Using a Dummy Array (for Arrays of Different Sizes)



# Tuning Approach to Cache Thrashing (Application)

Tuning Approach (Application)

# Tuning Approach (Application)



#### (array division +) array merging + dimensional displacement of an array



Chapter 8 Tuning Approach to Cache Thrashing (Application)



# Improvement in TLB Thrashing

- What Is TLB Thrashing?
- Padding (Before Improvement)
- Effects of Padding (Source Tuning)
- Effects of Page Size Expansion (Ipgparm Command)

## What Is TLB Thrashing?





# Padding (Before Improvement)



## TLB thrashing occurs because the page size is 4 MB and each array is located on a 512-MB boundary. Consequently, data access wait is a frequent event.



Chapter 8 Improvement in TLB Thrashing

# Effects of Padding (Source Tuning)



TLB thrashing was avoided because the address of each stream was shifted through padding. As a result, there was improvement in data access wait.



| L2 throughput<br>(GB/sec) | Memory throughput<br>(GB/sec) | µDTLB miss rate<br>(/Load-store instruction)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | mDTLB miss rate<br>(/Load-store instruction)                                                                 |                                                                                                                                                                  |  |  |  |  |  |
|---------------------------|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| 163.32                    | 20.79                         | 28.39728%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 12.12630%                                                                                                    | $\prec$                                                                                                                                                          |  |  |  |  |  |
| 272.45                    | 42.11                         | 0.01764%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 0.00023%                                                                                                     |                                                                                                                                                                  |  |  |  |  |  |
|                           | (GB/sec)<br>163.32            | (GB/sec) (GB/scc) (GB | (GB/sec)         (GB/sec)         (/Load-store instruction)           163.32         20.79         28.39728% | (GB/sec)         (GB/sec)         (/Load-store instruction)         (/Load-store instruction)           163.32         20.79         28.39728%         12.12630% |  |  |  |  |  |

Chapter 8 Improvement in TLB Thrashing

## Effects of Page Size Expansion (Ipgparm Command) F

TLB thrashing was avoided because the page size was expanded to 32 MB by the lpgparm command. As a result, there was improvement in data access wait.



Chapter 8 Improvement in TLB Thrashing



# Improvement in Data Access Wait (Increase in Data Locality)

- What Is Data Locality?
- Strip Mining
- Loop Blocking
- Sector Cache
- Loop Interchange
- Loop Fusion
- Array Merging (Indirect Access)

Chapter 8 Improvement in Data Access Wait (Increase in Data Locality)

## What Is Data Locality?



Data locality means the repeatedly accessing of data loaded in cache. Higher data locality reduces memory access load, resulting in an improvement in data access wait.



Chapter 8 Improvement in Data Access Wait (Increase in Data Locality)



# Strip Mining

- What Is Strip Mining?
- Strip Mining (Before Improvement)
- Effects of Strip Mining (Source Tuning)

# What Is Strip Mining?



Strip mining is a technique for increasing cache efficiency through alternating execution, in units of blocks, of two loops nested at the same level.



# Strip Mining (Before Improvement)



Not all array data can be loaded in cache because loop 1 has many iterations, so loop 2 cannot reuse the data. Consequently the following is a frequent event: No instruction commit due to memory and cache busy.



## Effects of Strip Mining (Source Tuning)



## Strip mining increases cache efficiency, which improves the following event: No instruction commit due to memory and cache busy.





# Loop Blocking

What Is Loop Blocking?

- Loop Blocking (Before Improvement)
- Effects of Loop Blocking (Source Tuning)

# Loop Blocking (1/3)



# Loop blocking is a technique for increasing cache efficiency. This technique divides source code into blocks of the specified size before execution.



# Loop Blocking (2/3)

#### Array access (before improvement)

Memory is accessed every time i is updated because of the stride access of array a. This results in the data loaded in the cache by access to a(1,1) being forced out before access at the a(2,1) access time.





# Loop Blocking (3/3)

#### ■ Array access (after improvement) with a block size of 96 x 16

Loop blocking causes block-by-block access.

This results in a cache hit during access of a(2,1) and increased cache efficiency.



# Loop Blocking (Before Improvement)



Cache use efficiency decreases because of stride access of array a, and the following event occurs: No instruction commit due to memory access for a floating-point load instruction.





iteration, but the data is already forced out by the time of the next j iteration. Consequently, a cache miss occurs.



| cacile                                                          |            |                                               |          |        |       |                                  |                                           |          |                              |                                  |
|-----------------------------------------------------------------|------------|-----------------------------------------------|----------|--------|-------|----------------------------------|-------------------------------------------|----------|------------------------------|----------------------------------|
|                                                                 | (effective | L1D miss rate<br>(/Load-store<br>instruction) | L1D miss |        |       | L1D miss swpf<br>rate(/L1D miss) | L2 miss rate(/Load-<br>store instruction) |          | L2<br>throughput<br>(GB/sec) | Memory<br>throughput<br>(GB/sec) |
| Before<br>improvement                                           | 0.09%      | 51.31%                                        | 1.28E+09 | 96.98% | 3.02% | 0.00%                            | 51.34%                                    | 1.28E+09 | 106.04                       | 109.34                           |
| The percentages of L1D misses and L2<br>misses are high at 51%. |            |                                               |          |        |       |                                  |                                           |          |                              |                                  |

# Effects of Loop Blocking (Source Tuning)

Reuse of data in array a through loop blocking increases cache efficiency, which improves the following event: No instruction commit due to memory access for a floating-point load instruction.



#### Cache

|                       | N     | L1D miss rat<br>(/Load-store<br>instruction) |        | L1D miss |        | L1D miss hwpf<br>rate(/L1D miss) | 1 1 1 - 1 N | L2 miss rate(/Load-<br>store instruction) | L2 miss  | L2 throughput | Memory<br>throughput<br>(GB/sec) |
|-----------------------|-------|----------------------------------------------|--------|----------|--------|----------------------------------|-------------|-------------------------------------------|----------|---------------|----------------------------------|
| Before<br>improvement | 0.09% |                                              | 51.31% | 1.28E+09 | 96.98% | 3.02%                            | 0,00%       | 51.34%                                    | 1.28E+09 | 106.04        | 109.34                           |
| After<br>improvement  | 0.01% | 0                                            | 6.69%  | 1.70E+08 | 95.65% | 4.35%                            | 0.00%       | 6.12%                                     | 1.56E+08 | 97.61         | 111.78                           |

The percentages of L1D misses and L2 misses decreased significantly.



# Sector Cache

- What Is a Sector Cache?
- Overview of Sector Cache Capacity Control
- Conceptual Diagram of Actual Operation
- How to Use a Sector Cache
- Sector Cache Improvement Example
- Sector Cache: Case Example 1
- Sector Cache: Case Example 2

# What Is a Sector Cache?



A sector cache is a cache mechanism that can prevent reusable data from being forced out of the cache by non-reusable data. This mechanism enables applications to divide the cache into two parts (sector 0 and sector 1) and use them. (Reused arrays use sector 1, and the others use sector 0.)



Overview of Sector Cache Capacity Control (1/2)

- A cache miss is an opportunity to adjust the capacity. The capacity is not forcibly adjusted.
  - If the capacity of a sector is less than the capacity specified in the control register, the number of ways is increased until the capacity is reached.



Even if the sector of a cache miss has a greater capacity than specified, the capacity does not decrease.



Overview of Sector Cache Capacity Control (2/2)

- The number of ways of a sector may exceed the specified capacity.
- 1. Case with a free way



#### 2. Case of a hit in access of another sector



#### Conceptual Diagram of Actual Operation

(Sector 0: 7 Ways; Sector 1: 3 Ways)



## How to Use a Sector Cache (1/2)



#### • Sector cache: Pseudo local memory

Software can use sectors effectively according to the reusability of data.

- Reused arrays > Sector 1 used
- Others is Sector 0 used
- Data on sector 1 is not forced out by other data.
- The user can specify in a directive line that the array be in sector 1.



Example of using compiler directive lines for sector cache specification

#### <Purpose>

The purpose is to prevent array a, which has reusability, from being forced out of the cache by access to arrays b and c in a loop.

## How to Use a Sector Cache (2/2)



#### To use a sector cache, specify the following optimization control lines.

|                                                                         |                                                                                                              | Optimization control line that can be specified |                 |                   |                                          |  |  |
|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------|-----------------|-------------------|------------------------------------------|--|--|
| Optimization control specifiers                                         | Meaning                                                                                                      | Program<br>unit                                 | DO loop<br>unit | Statement<br>unit | Array<br>assignment<br>statement<br>unit |  |  |
| CACHE_SECTOR_SIZE<br>(l1_n1,l1_n2,l2_n1,l2_n1)<br>END_CACHE_SECTOR_SIZE | Specifies the maximum numbers of ways of sector 0 and ways of sector 1 in the L1 cache and L2 cache.         | Yes                                             | No              | Yes               | No                                       |  |  |
| CACHE_SECTOR_SIZE<br>(l2_n1,l2_n2)<br>END_CACHE_SECTOR_SIZE             | Specifies the maximum number of ways of sector 0 and the maximum number of ways of sector 1 in the L2 cache. | Yes                                             | No              | Yes               | No                                       |  |  |
| CACHE_SUBSECTOR_ASSIGN(array<br>1[,array2])<br>END_CACHE_SUBSECTOR      | Specifies the array to place in sector 1 of the cache.                                                       | Yes                                             | No              | Yes               | No                                       |  |  |

## Sector Cache Improvement Example (1/2)

# In this example, reusable data in array b is forced out of the cache, resulting in a cache miss. The assumed model in the descriptions is a 6-MB/12-way L2 cache.



Chapter 8 Sector Cache

## Sector Cache Improvement Example (2/2)

# The following example shows a way to prevent reusable data in array b from being forced out of the cache. Conceptual diagrams of L2 cache (6-MB/12-way) states



ĬĬTSU

#### Sector Cache: Case Example 1 (Before Improvement)

#### Data in array b cannot be reused because it has been forced out of the cache. Consequently, the following is a frequent event: No instruction commit due to memory and cache busy.



#### Sector Cache: Case Example 1 (After Improvement)

#### FUĴITSU

# Placing array b in sector 1 increases cache efficiency, which improves the following event: No instruction commit due to memory and cache busy.



#### Sector Cache: Case Example 2 (Before Improvement)



Data in array u cannot be reused because it has been forced out of the cache. Consequently, the following is a frequent event: No instruction commit due to memory and cache busy.



#### Sector Cache: Case Example 2 (After Improvement)



## Placing part of the k dimension of array u in sector 1 increases cache efficiency, which improves the following event: No instruction commit due to memory and cache busy.



#### Sector Cache: Case Example 2 (Cyclic Distribution)

In this case example, the schedule(static,1) specification divides the array into smaller parts to which cache memory is cyclically allocated. Then, parallel execution is performed. The effect of this technique can be equivalent to a sector cache.





# Loop Interchange

- Loop Interchange (Before Improvement)
- Contents of Loop Interchange Tuning
- Effects of Loop Interchange (Source Tuning)

## Loop Interchange (Before Improvement)

Cache use efficiency decreases because of stride access of arrays b, c, and d. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



## Contents of Loop Interchange Tuning





# Effects of Loop Interchange (Source Tuning) Fuj

Cache efficiency increases because of sequential access of an array through loop interchange. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.





# Loop Fusion

- Loop Fusion (Before Improvement)
- Effects of Loop Fusion (Source Tuning)
- Loop Fusion (in C Language) (Before Improvement)
- Effects of Loop Fusion (in C Language) (Source Tuning)

## Loop Fusion (Before Improvement)

FUĴITSU

Not all array data can be loaded in cache because loop 1 has many iterations, so loop 2 cannot reuse the data. Consequently, the following is a frequent event: No instruction commit due to memory and cache busy.



## Effects of Loop Fusion (Source Tuning)

Loop fusion increases cache efficiency, which improves the following event: No instruction commit due to memory and cache busy.



### Loop Fusion (in C Language) (Before Improvement) Fujinsu

Not all array data can be loaded in cache because loop 1 has many iterations, so loop 2 cannot reuse the data.

Consequently, the following is a frequent event: No instruction commit due to memory and cache busy.



### Effects of Loop Fusion (in C Language) (Source Tuning)

Loop fusion increases cache efficiency, which improves the following event: No instruction commit due to memory and cache busy.





# Array Merging (Indirect Access)

- What Is Array Merging?
- Array Merging (Before Improvement)
- Effects of Array Merging (Source Tuning)
- Array Merging (in C Language) (Before Improvement)
- Effects of Array Merging (in C Language) (Source Tuning)

## What Is Array Merging?



Array merging is the merging of multiple arrays into one array. These multiple arrays are processed in the same loop and have a common access pattern. This technique realizes sequential data access and increases cache efficiency.

| Source code before impro             | vement      | After improvement (appearance after compiler optimization) |                                    |  |  |  |
|--------------------------------------|-------------|------------------------------------------------------------|------------------------------------|--|--|--|
| parameter(n=1000000)                 |             | parameter(n=100000)                                        |                                    |  |  |  |
| real*8 a(n), b(n), c(n)              |             | real*8 abc(3, n)                                           |                                    |  |  |  |
| integer d(n+10)                      |             | integer d(n+10)                                            |                                    |  |  |  |
| Array merg                           | ing         | :                                                          |                                    |  |  |  |
| do iter = 1, 100                     | -           | do iter = 1, 100                                           |                                    |  |  |  |
| do i = 1 , n                         | (L1D cache) | do i = 1 , n                                               |                                    |  |  |  |
| a(d(i)) = b(d(i)) + scalar * c(d(i)) | a(d(i))     | abc(1, d(i)) = abc(2, d(i)) + scalar * abc(3, d(i))        |                                    |  |  |  |
| enddo                                | • • •       | enddo                                                      |                                    |  |  |  |
| enddo                                | a( d(i+1))  | enddo                                                      |                                    |  |  |  |
| :                                    |             | :                                                          | (L1D cache)                        |  |  |  |
|                                      | b( d(i))    |                                                            | abc( 1, d(i))                      |  |  |  |
| Access of different                  |             |                                                            | abc( 2, d(i))                      |  |  |  |
| cache lines                          | b( d(i+1))  |                                                            | abc( 3, d(i))                      |  |  |  |
|                                      | c( d(i))    | Access of same                                             | $\frac{1}{2}$                      |  |  |  |
|                                      |             | cache line                                                 | abc( 1, d(i+1))<br>abc( 2, d(i+1)) |  |  |  |
|                                      | c( d(i+1))  |                                                            | abc( 2, d(i+1))                    |  |  |  |
|                                      |             |                                                            |                                    |  |  |  |
|                                      |             |                                                            |                                    |  |  |  |

## Array Merging (Before Improvement)



Cache use efficiency decreases because of a high percentage of L1D misses (list access). Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.

| Source code before improvement                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                           |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| 1         parameter(n=2*1000*1000/8)           2         real*8 a(n),b(n),c(n),e(n),f(n),s           3         integer d(n)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | [sec]<br>6.0E-01                                                                                                          |
| :<br>14 1 s s call sub(a,b,c,d,e,f,s,n)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 5.0E-01                                                                                                                   |
| 25 subroutine sub(a,b,c,d,e,f, s, n)<br>26 real*8 a(n),b(n),c(n),e(n),f(n),s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 4.0E-01                                                                                                                   |
| <ul> <li>27 integer d(n), ii</li> <li>28</li> <li>29 !\$omp parallel do schedule (static,96)</li> <li>Arrays a, f, e, b, and</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 3.0E-01                                                                                                                   |
| <pre><code contraction="" start="">&gt;&gt; </code>&gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt; &gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt; &gt;&gt;&gt; &gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;</pre> | 2.0E-01 to L2 access<br>for a floating-                                                                                   |
| <<< SOFTWARE PIPELINING<br><<< Loop-information End >>><br>30 1 p 2v do i = 1 , n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 1.0E-01 point load instruction                                                                                            |
| 31 1 p 2v ii = d(i)<br>32 1 p 2v a(ii) = s / (s + f(ii) / (s + e(ii) / (b(ii) + s / c(ii))))<br>33 1 p 2v enddo                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 0.0E+00                                                                                                                   |
| 34 !\$omp end parallel do<br>:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Before improvement                                                                                                        |
| Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | The percentage of L1D misses is high at 78.45%.                                                                           |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | L1D miss hwpf L1D miss swpf rate(/L1D miss) L2 miss rate(/Load-store instruction) L2 miss L2 throughput (GB/sec) (GB/sec) |
| Before 0.00% 78.45% 1.27E+09 100.00%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                           |

## Effects of Array Merging (Source Tuning)

Array merging for list access increases cache efficiency, which improves the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### The percentage of L1D misses decreased significantly.

| cache              |                                             |                                              |             | /                              |       |        |                                           |          |        |                                  |
|--------------------|---------------------------------------------|----------------------------------------------|-------------|--------------------------------|-------|--------|-------------------------------------------|----------|--------|----------------------------------|
|                    | L11 miss rate<br>(effective<br>instruction) | L1D miss<br>rate(/Load-store<br>instruction) | III) micc / | L1D miss dm<br>rate(/L1D miss) |       | I . '. | L2 miss rate(/Load-<br>store instruction) | L2 miss  | , , ,  | Memory<br>throughput<br>(GB/sec) |
| Before improvement | 0.00%                                       | 78.45%                                       | 1.27E+09    | 100.00%                        | 0.00% | 0.00%  | 0.00%                                     | 6.70E+04 | 649.65 | 0.06                             |
| After improvement  | 0.00%                                       | 18.01%                                       | 2.90E+08    | 99.99%                         | 0.01% | 0.00%  | 0.00%                                     | 1.52E+04 | 194.49 | 0.03                             |

Cacho

### Array Merging (in C Language) (Before Improvement)



Cache use efficiency decreases because of a high percentage of L1D misses (list access). Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



Chapter 8 Array Merging (Indirect Access)

### Effects of Array Merging (in C Language) (Source Tuning) Fujitsu

Array merging for list access increases cache efficiency, which improves the following event: No instruction commit due to L2 access for a floating-point load instruction.



#### The percentage of L1D misses decreased significantly.

| Cacile                |       |                                              |               |                                |       |                  |                                             |          |                           |                                  |
|-----------------------|-------|----------------------------------------------|---------------|--------------------------------|-------|------------------|---------------------------------------------|----------|---------------------------|----------------------------------|
|                       | · ·   | L1D miss<br>rate(/Load-store<br>instruction) | II II) miss / | L1D miss dm rate(/L1D<br>miss) |       | rate(/ 1 ) m(ss) | L2 miss<br>rate(/Load-store<br>instruction) | L2 miss  | L2 throughput<br>(GB/sec) | Memory<br>throughput<br>(GB/sec) |
| Before<br>improvement | 0.00% | 70.79%                                       | 1.28E+08      | 100.00%                        | 0.00% | 0.00%            | 0.02%                                       | 4.12E+04 | 552.20                    | 0.34                             |
| After<br>improvement  | 0.00% | 12.41%                                       | 2.90E+07      | 99.99%                         | 0.01% | 0.00%            | 0.01%                                       | 1.36E+04 | 144.88                    | 0.13                             |

Cacho



## Improvement in Data Access Wait (Latency Concealment)

- What Is Latency Concealment?
- Indirect Access Prefetching
- Prefetching for an Outer Loop

## What Is Latency Concealment?



Latency concealment means concealing the latency of data access (the period of time from a data transfer request to its acknowledgement) by prefetching data. There are three types of data access: L1D cache access, L2 cache access, and memory access. For L2 cache access and memory access among these types, this section discusses only the latency visible as execution time.

For the latency time of each data access type, see the LMbench results below.



#### Results of data access latency measurement with LMbench



# Indirect Access Prefetching

- Indirect Access Prefetching (Before Improvement)
- Effects of Indirect Access Prefetching (Optimization Control Line Tuning)
- Effects of Indirect Access Prefetching (Optimization Control Line)
- Effects of Indirect Access Prefetching (Compiler Options Tuning)

### Indirect Access Prefetching (Before Improvement)

Indirect access (list access) with the recommended options does not generate a prefetch. Also, memory access latency is visible. Consequently, the following is a frequent event: No instruction commit due to memory access for a floating-point load instruction.



#### Cache

|                       | L1I miss rate<br>(effective<br>instruction) | L1D miss<br>rate(/Load-store<br>instruction) | L1D miss   | L1D miss dm<br>rate(/L1D miss)                         | L1D miss hwpf<br>rate(/L1D miss) |       | L2 miss<br>rate(/Load-store<br>instruction) | L2 miss | L2 miss dm<br>rate(/L2 miss) |        | Memory<br>throughp<br>ut<br>(GB/sec) |
|-----------------------|---------------------------------------------|----------------------------------------------|------------|--------------------------------------------------------|----------------------------------|-------|---------------------------------------------|---------|------------------------------|--------|--------------------------------------|
| Before<br>improvement | 0.01%                                       | 42.97%                                       | 1.94E+09   | 94.20%                                                 | 0.00%                            | 5.80% | 12.48%                                      | 5.63E+0 | 48.76%                       | 111.02 | 35.47                                |
| and prefe             | tching is n                                 | ot effective.                                | Performanc | n percentage are<br>re may increase<br>nd L2 throughpu | because                          |       |                                             |         |                              |        |                                      |

#### Effects of Indirect Access Prefetching (Optimization Control Line Tuning)



Specification of the **prefetch specifier** generates a prefetch for indirect access (list access). This results in improvement of the following event: No instruction commit due to memory access for a floating-point load instruction.



#### Cache

|                       | · · · · | L1D miss<br>rate(/Load-store<br>instruction) | L1D miss | - |        |       | LTD miss swpf<br>rate(/LTD miss) | L2 miss<br>rate(/Load-store<br>instruction) | L2 miss  | L2 miss dm rate(/L2<br>miss) |        | Memory<br>throughput<br>(GB/sec) |
|-----------------------|---------|----------------------------------------------|----------|---|--------|-------|----------------------------------|---------------------------------------------|----------|------------------------------|--------|----------------------------------|
| Before<br>improvement | 0.01%   | 42.97%                                       | 1.94E+09 |   | 94.20% | 0.00% | 5.80%                            | 12.48%                                      | 5.63E+08 | 48.76%                       | 111.02 | 35.47                            |
| After<br>improvement  | 0.00%   | 38.12%                                       | 3.09E+09 |   | 52.34% | 0.00% | 47.66%                           | 8.84%                                       | 7.16E+08 | 5.34%                        | 404.14 | 101.08                           |

## The generation of prefetch instructions for indirect access (arrays b and c) reduced the L1D miss dm percentage and L2 miss dm percentage.

#### Indirect Access Prefetching (Optimization Control Line)



#### Here, specify the following optimization control line.

|                                 |                                                          | Optimization control line that can be specified |              |                   |                                       |  |  |  |
|---------------------------------|----------------------------------------------------------|-------------------------------------------------|--------------|-------------------|---------------------------------------|--|--|--|
| Optimization control specifiers | Meaning                                                  | Program unit                                    | DO loop unit | Statement<br>unit | Array<br>assignment<br>statement unit |  |  |  |
| prefetch                        | Enables the automatic prefetch function of the compiler. | Yes                                             | Yes          | No                | No                                    |  |  |  |



 The prefetch optimization control specifiers is equivalent to specifying the following compiler options:

-Kprefetch\_sequential,prefetch\_stride,prefetch\_indirect,

prefetch\_conditional,prefetch\_cache\_level=all

#### Notes

 Depending on the cache efficiency of loops, whether branching exists, and the complexity of subscripts, prefetching with the compiler options -Kprefetch\_sequential, -Kprefetch\_stride, -Kprefetch\_indirect, or -Kprefetch\_conditional enabled may degrade execution performance.

#### Effects of Indirect Access Prefetching(Compiler Options Tuning) Fujitsu

# You can achieve effects similar to optimization control line tuning by specifying the following compiler options.

| Compiler options    | Description of function                                                                                                                                                                                                                                                            |
|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -Kprefetch_indirect | Gives an instruction on whether to generate an object that uses a prefetch<br>instruction for indirectly accessed (list access) array data used inside a loop.<br>This option has meaning in cases where -O1 or a higher option is valid.<br>The default is -Kprefetch_noindirect. |

#### Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Kprefetch\_indirect

#### Notes

•Depending on the cache efficiency of loops, whether IF construct are used, and the complexity of subscripts, prefetching may not have the intended effect.



# Prefetching for an Outer Loop

- Prefetching for an Outer Loop (Before Improvement)
- Effects of Prefetching for an Outer Loop (Optimization Control Line Tuning)
- Use of software prefetch

### Prefetching for an Outer Loop (Before Improvement) Fuirs

The innermost loop has a few iterations, and its array size is greater than its number of iterations. For this reason, the cost at the prefetching rise time is visible in normal prefetching. Consequently, the following is a frequent event: No instruction commit due to L2 access for a floating-point load instruction.



#### Effects of Prefetching for an Outer Loop(Optimization Control Line Tuning) FUITSU

To conceal the cost at the prefetching rise time, the PREFETCH\_READ and PREFETCH\_WRITE specifiers were used to generate a prefetch for the arrays in an outer loop. This results in improvement of the following event: No instruction commit due to L2 access for a floating-point load instruction.



## Use of software prefetch



In case of sequential access, hardware prefetching may not be effective even if co mpiler option -Kprefetch\_sequential=auto is effective.

When L1D miss dm rate or L2 miss dm rate is high, performance may improve with -Kprefetch\_sequential=soft specified (software prefetch will be effective)



| 翻訳時オプション                   | 機能説明                                                                                                                                                                                                                                                                                                                                        |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -Kprefetch_sequential=auto | The compiler automatically selects whether to use hardware-prefetch or to create prefetch instructions for array data that is accessed sequentially within a loop.<br>-Kprefetch_sequential=auto is effective only when the -O1 option or higher is set.<br>The default when the -O2 option or higher is set is -Kprefetch_sequential=auto. |
| -Kprefetch_sequential=soft | The compiler does not use hardware-prefetch, but rather creates prefetch instructions for array data th at is accessed sequentially within a loop.<br>-Kprefetch_sequential=soft is effective only when the -O1 option or higher is set.                                                                                                    |
| -Kprefetch_nosequential    | Prefetch instructions are not generated for array data that is accessed sequentially within a loop.<br>The default when the -O0 or -O1 option is set is -Kprefetch_nosequential.                                                                                                                                                            |



## Improvement in Data Access Wait (Reduced Amount of Access)

Memory Throughput and Amount of Memory Access
 High-speed Store (XFILL)



## Memory Throughput and Amount of Memory Access

### Memory Throughput and Amount of Memory Access Fujing

Amount of memory access:

(number of L2 cache misses + L2 writebacks) x 256 Byte (line size)

The performance of a program with a memory throughput bottleneck does not increase unless the program is tuned to decrease the amount of memory access or the number of L2 cache misses.





# High-speed Store (XFILL)

- What Is High-speed Store (XFILL)?
- XFILL (Before Improvement)
- Effects of XFILL (Optimization Control Line Tuning)
- Effects of XFILL (Compiler Options Tuning)

#### Chapter 8 High-speed Store (XFILL)

## What Is High-speed Store (XFILL)?

What is high-speed store (XFILL)?

This function reserves a cache line for cache write operations (contents with indefinite values). The function helps reduce the number of cache lines read from memory to increase the performance of a program with a memory throughput bottleneck.

- Operating conditions
  - The array that is the store target has no dependency between iteration cycles.
  - Arrays with definitions are not referenced.
  - Memory is accessed contiguously.
     XFILL not used







149

FUĴĨTSU

### XFILL (Before Improvement)



## Memory throughput is a bottleneck because a program has a heavy load on memory access. Consequently, data access wait is a frequent event.



#### Cache

|                       | · ·   | L1D miss<br>rate(/Load-store<br>instruction) | I II) mice | L1D miss dm<br>rate(/L1D miss) | L1D miss hwpf<br>rate(/L1D miss) | L1D miss swpf<br>rate(/L1D miss) | L2 miss<br>rate(/Load-store<br>instruction) | L2 miss  |        | Memory<br>throughput<br>(GB/sec) |
|-----------------------|-------|----------------------------------------------|------------|--------------------------------|----------------------------------|----------------------------------|---------------------------------------------|----------|--------|----------------------------------|
| Before<br>improvement | 0.01% | 3.13%                                        | 9.38E+07   | 0.79%                          | 99.21%                           | 0.00%                            | 3.13%                                       | 9.39E+07 | 106.37 | 141.89                           |

#### Memory throughput is a bottleneck.

### Effects of XFILL (Optimization Control Line Tuning)

FUĴĨTSL

The specification of the XFILL specifier eliminated the reading of cache lines from memory by a store instruction. This reduced the L2 miss. As a result, there was improvement in data access wait.



| Cacile             |                                             |                                              |          |                                |        |                                  |                                             | \<br>\   |               |                                  |
|--------------------|---------------------------------------------|----------------------------------------------|----------|--------------------------------|--------|----------------------------------|---------------------------------------------|----------|---------------|----------------------------------|
|                    | L11 miss rate<br>(effective<br>instruction) | L1D miss<br>rate(/Load-store<br>instruction) | L1D miss | L1D miss dm<br>rate(/L1D miss) |        | L1D miss swpf<br>rate(/L1D miss) | L2 miss<br>rate(/Load-store<br>instruction) | L2 miss  | L2 throughput | Memory<br>throughput<br>(GB/sec) |
| Before improvement | 0.01%                                       | 3.13%                                        | 9.38E+07 | 0.79%                          | 99.21% | 0.00%                            | 3.13%                                       | 9.39E+07 | 106.37        | 141.89                           |
| After improvement  | 0.01%                                       | 3.09%                                        | 9.38E+07 | 29.49%                         | 65.74% | 4.77%                            | 2.06%                                       | 6.26E+07 | 158.73        | 158.85                           |

### XFILL (Optimization Control Line Tuning)



### Here, specify the following optimization control line.

|                                 |                                                                                                                                                                | Optimization control line that can be specified |              |                   |                                       |  |
|---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|--------------|-------------------|---------------------------------------|--|
| Optimization control specifiers | Meaning                                                                                                                                                        | Program unit                                    | DO loop unit | Statement<br>unit | Array<br>assignment<br>statement unit |  |
| XFILL[(m1)]                     | Gives an instruction to generate an XFILL<br>instruction. m1 is a decimal number in a range<br>of 1 to 100 that indicates the number of lines<br>of the cache. | No                                              | Yes          | No                | Yes                                   |  |
| NOXFILL                         | Gives an instruction not to generate an XFILL instruction.                                                                                                     | No                                              | Yes          | No                | Yes                                   |  |

#### Notes

- The XFILL instruction is output for array data that is stored in a loop. However, it is not output for arrays referenced in the same loop, arrays accessed non-sequentialy, and arrays stored in IF construct.
- No prefetch instruction is output to the L2 cache when the XFILL instruction is output.
- The following optimization methods cannot be applied because loops are transformed to always store the cache lines reserved by the XFILL instructions. For this reason, execution performance may deteriorate.
  - Loop unrolling
  - Loop striping
- Execution performance may also deteriorate in the following case:
  - Loop with a few iterations



# You can achieve effects similar to optimization control line tuning by specifying the following compiler options.

| Compiler options                         | Description of function                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -K{ XFILL[=N]   NOXFILL }<br>1 ≦ N ≦ 100 | Gives an instruction regarding array data that is only written in a<br>loop, to generate an instruction (XFILL instruction) that reserves a<br>cache line for cache writing without loading data from memory.<br><i>N</i> specifies the data that is <i>N</i> cache lines away as the target of the<br>XFILL instruction.<br>You can specify a value in a range of 1 to 100 for <i>N</i> . If the<br>specification of <i>N</i> is omitted, the compiler automatically<br>determines a value.<br>This option has meaning in cases where -O2 or a higher option is<br>valid. The default is -KNOXFILL. |

#### Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -KXFILL



## Improvement in Operation Wait (Instruction Scheduling Improvement)

- Factors Hindering Instruction Scheduling
- Hindering Factor: Improvement of a Loop Containing an IF Statement
- Hindering Factor: Improvement in Data Dependency
- Hindering Factor: Improvement of a Loop with a Few Iterations

### Factors Hindering Instruction Scheduling

FUjitsu

The following factors hinder instruction scheduling.

- Loop containing an IF construct
- Data dependency between iteration cycles
  - Loop that has data dependency
  - Loop that has an unclear definition reference relationship
  - Loop containing pointer variables
- Loop with a few iterations



## Hindering Factor: Improvement of a Loop Containing an IF Construct

- SIMD Extensions with the Mask (Basics)
- SIMD Extensions with the Mask (Application)



# SIMD Extensions with the Mask (Basics)

- SIMD Extensions with the Mask (Before Improvement)
- Effects of SIMD Extensions with the Mask (Optimization Control Line Tuning)
- SIMD Extensions with the Mask (Optimization Control Line)
- Effects of SIMD Extensions with the Mask (Compiler Options Tuning)

### SIMD Extensions with the Mask (Before Improvement)

FUjitsu

SIMD optimization and software pipelining are not facilitated because the loop contains an IF construct.Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.



#### SIMD

|                       | SIMD instruction rate (effective instruction) | SIMD floating point instruction rate (/SIMD target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate<br>(/SIMD target load-store instruction) |
|-----------------------|-----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------|
| Before<br>improvement | 0.00%                                         | 0.00%                                                                          | 0.00%                                                            | 0.00%                                                                     |
|                       |                                               |                                                                                |                                                                  |                                                                           |
| There                 | is no SIMD optimizati                         | on.                                                                            |                                                                  |                                                                           |

### Effects of SIMD Extensions with the Mask (Optimization Control Line Tuning) FUITSU

Specification of the SIMD specifier facilitates software pipelining through SIMD extensions with the mask. The result is a reduction in effective instruction, a decrease in instruction commits, facilitation of instruction scheduling, and a significant improvement in the following event: No instruction commit waiting for a floating-point instruction to be completed.



Chapter 8 SIMD Extensions with the Mask (Basics)



### Here, specify the following optimization control line.

| optimization          | ptimization                |              | Optimization control line that can be specified |                |                                    |  |  |  |
|-----------------------|----------------------------|--------------|-------------------------------------------------|----------------|------------------------------------|--|--|--|
| control<br>specifiers | Meaning                    | Program unit | DO loop unit                                    | Statement unit | Array assignment<br>statement unit |  |  |  |
| SIMD                  | Enables SIMD optimization. | Yes          | Yes                                             | No             | Yes                                |  |  |  |



• SIMD optimization may not be realized depending on the operation type and loop structure.

# You can achieve effects similar to optimization control line tuning by specifying the following compiler options.

| Compiler options | Description of function                                                                                                                                             |  |  |
|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
|                  | Gives an instruction to generate an object that uses a SIMD expansion instruction, in addition to the -Ksimd=1 function, for loops containing an IF construct, etc. |  |  |

### Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Ksimd=2

#### Notes

- Execution performance may deteriorate depending on the true ratio of the IF construct.
- The execution of an instruction that should not be executed from the perspective of program logic may cause an error because expressions inside IF construct are speculatively executed.



# SIMD Extensions with the Mask (Application)

- SIMD Extensions with the Mask (Before Improvement)
- Effects of SIMD Extensions with the Mask: Process 1 (Optimization Control Line Tuning)
- Effects of SIMD Extensions with the Mask: Process 2 (Optimization Control Line Tuning + Source Tuning)
- Effects of SIMD Optimization through Loop Unswitching (Before Improvement)
- Effects of SIMD Optimization through Loop Unswitching (After Improvement)
- Appearance of Code Optimized by Loop Unswitching
- Array Division (Before Improvement)
- Effects of Array Division (Source Tuning)

### SIMD Extensions with the Mask (Before Improvement)

SIMD optimization and software pipelining are not facilitated because the loop contains an IF construct. Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.





#### SIMD

|                       | SIMD instruction rate (effective instruction) | SIMD floating point instruction rate (/SIMD target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate<br>(/SIMD target load-store instruction) |
|-----------------------|-----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------|
| Before<br>improvement | 0.00%                                         | 0.00%                                                                          | 0.00%                                                            | 0.00%                                                                     |
| Ther                  | e is no SIMD optimizat                        | ion.                                                                           |                                                                  |                                                                           |

FUÏITSU

### Effects of SIMD Extensions with the Mask: Process 1 (Optimization Control Line Tuning)FUITSU

Specification of the SIMD specifier facilitates software pipelining through SIMD extensions with the mask. This results in an improvement in the following event: No instruction commit waiting for a floating-point instruction to be completed. However, an adverse effect of SIMD extensions with the mask was an increased effective instruction, which was the cause of an increase in instruction commits.



#### Effects of SIMD Extensions with the Mask: Process 2(Optimization Control Line Tuning + Source Tuning) FUITSU

The adverse effect of SIMD extensions with the mask could be reduced in the next step, which is loop division and SIMD optimization of only IF construct that have a high true ratio. This results in a decreased effective instruction and improved execution performance.



### Effects of SIMD Optimization through Loop Unswitching (Before Improvement) FUITSU

There is neither SIMD optimization nor effective software pipelining because the innermost loop contains an IF construct. Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.

| Before optimization                                                        | [soc]                               |
|----------------------------------------------------------------------------|-------------------------------------|
| 97 1 !\$omp do                                                             | [sec]                               |
| <<< Loop-information Start >>>                                             | 4.0E+00                             |
| <<< [OPTIMIZATION]                                                         |                                     |
| <<< UNSWITCHING                                                            | 3.5E+00                             |
| <<< Loop-information End >>>                                               | Four                                |
| 98 2 p 4s doi=1,n1                                                         | a or too                            |
| 99 2                                                                       | 3.0E+00 commit                      |
| 100 3 p 4v if $(n1 \ge q)$ then                                            |                                     |
| 101 3 p 4v $a(i) = c0+b(i)*(c1+b(i)*(c2+b(i)*$                             | 2.5E+00                             |
| (c3+b(i)*c4)))                                                             |                                     |
| 102 3 p 4v endif                                                           | 2.0E+00                             |
|                                                                            |                                     |
| 104 3 p 4v if(n1 > r) then<br>105 3 p 4v a(i) = c0*b(i)/(c1*b(i)/(c2*b(i)/ | 1.5E+00                             |
| (c3*b(i)/c4)))                                                             | No instruction                      |
| 106 3 p 4s endif                                                           | commit                              |
| 107 2                                                                      | 1.0E+00 waiting for a               |
| 108 3 p 4s if(n1 < s) then                                                 | floating-point                      |
| 109 3 p 4s $a(i) = c0+b(i)/(c1+b(i)/(c2+b(i)/$                             | 5.0E-01 instruction to be completed |
| (c3+b(i)/c4)))                                                             | becompleted                         |
| 110 3 p 4v endif                                                           | 0.0E+00                             |
| 111 2 p 4v enddo                                                           | Before improvement                  |
| 112 1 !\$omp enddo                                                         |                                     |

|                       | SIMD instruction rate (effective instruction) | SIMD floating point instruction rate (/SIMD target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate<br>(/SIMD target load-store instruction) |
|-----------------------|-----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------|
| Before<br>improvement | 0.00%                                         | 0.00%                                                                          | 0.00%                                                            | 0.00%                                                                     |
| There                 | is no SIMD optimizati                         | ion                                                                            |                                                                  |                                                                           |

### Effects of SIMD Optimization through Loop Unswitching (After Improvement)



Specification of loop unswitching for the IF construct improves instruction scheduling and facilitates SIMD optimization and software pipelining. The result is a significant improvement in the following event: No instruction commit waiting for a floating-point instruction to be completed.



### Appearance of Code Optimized by Loop Unswitching



| <<< [OPTIMIZATION]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                  | optimized code                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Source code                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <<<< UNSWITCHING<br>$<<<Process (3)enddo982p4vdo i=1,n1endif992locl unswitchingCondition (1)1003p4vif (n1>=q) then1013p4va(i) = c0+b(i)*(c1+b(i)*(c2+b(i)*(c3+b(i)*c4)))Process (1)1023p4vendif1032locl unswitchingCondition (2)1043p4va(i) = c0*b(i)/(c1*b(i)/(c2*b(i)/(c3*b(i)/c4)))Process (2)1063p4vendif1072locl unswitchingCondition (3)1083p4va(i) = c0+b(i)/(c1+b(i)/(c2+b(i)/(c3+b(i)/c4)))Process (2)1083p4va(i) = c0+b(i)/(c1+b(i)/(c2+b(i)/(c3+b(i)/c4)))Process (3)1093p4va(i) = c0+b(i)/(c1+b(i)/(c2+b(i)/(c3+b(i)/c4)))Process (3)1103p4va(i) = c0+b(i)/(c1+b(i)/c2+b(i)/(c3+b(i)/c4)))Process (3)1103p4va(i) = c0+b(i)/(c1+b(i)/c2+b(i)/(c3+b(i)/c4)))on (2)on (2)if((condition (1) false).and.(condition (2) true).and.(condition (3) true))then     do i=1,n1         Process (2)         Process (3)         enddo     endif  !Pattern (6) if((condition (1) false).and.(condition (2) true).and.(condition (3) false))then     do i=1,n1         Process (2)         enddo     endif  !Pattern (7) if((condition (1) false).and.(condition (2) false).and.(condition (3) true))then     do i=1,n1         Process (3)         enddo endif  ! Pattern (8) if((condition (1) false).and.(condition (2) false).and.(condition (3) false))then     do i=1,n1         Process (3)     enddo endif if((condition (1) true).and.(condition (2) true).and.(condition (3) true))then     do i=1,n1         Process (1)         Process (2)         Process (2)         Process (3)         enddo     endif      !Pattern (2)     if((condition (1) true).and.(condition (2)     true).and.(condition (3) false))then     do i=1,n1         Process (1)         Process (2)         enddo     endif      !Pattern (3)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) true))then     do i=1,n1         Process (1)         Process (1)         Process (3)         enddo     endif      !Pattern (4)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) true))then     do i=1,n1         Process (3)     enddo     endif      !Pattern (4)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) false))then     do i=1,n1         Process (1)     enddo     endif <     <         <             <                 <                 <                  <                   <                       <                        <                        <                         <                         <                         <                         <                         <                         <                         <                         <                         <                               <$ | on (2)<br>on (2) | <pre>if((condition (1) false).and.(condition (2) true).and.(condition (3) true))then     do i=1,n1         Process (2)         Process (3)         enddo     endif  !Pattern (6) if((condition (1) false).and.(condition (2) true).and.(condition (3) false))then     do i=1,n1         Process (2)         enddo     endif  !Pattern (7) if((condition (1) false).and.(condition (2) false).and.(condition (3) true))then     do i=1,n1         Process (3)         enddo endif  ! Pattern (8) if((condition (1) false).and.(condition (2) false).and.(condition (3) false))then     do i=1,n1         Process (3)     enddo endif </pre> | <pre>if((condition (1) true).and.(condition (2) true).and.(condition (3) true))then     do i=1,n1         Process (1)         Process (2)         Process (2)         Process (3)         enddo     endif      !Pattern (2)     if((condition (1) true).and.(condition (2)     true).and.(condition (3) false))then     do i=1,n1         Process (1)         Process (2)         enddo     endif      !Pattern (3)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) true))then     do i=1,n1         Process (1)         Process (1)         Process (3)         enddo     endif      !Pattern (4)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) true))then     do i=1,n1         Process (3)     enddo     endif      !Pattern (4)     if((condition (1) true).and.(condition (2)     false).and.(condition (3) false))then     do i=1,n1         Process (1)     enddo     endif </pre> | <pre>&lt;<ul>     <li>&lt;<ul>         <li>&lt;<ul>             <li>&lt;<ul>                 <li>&lt;<ul>                 <li>&lt;<ul>                  <li>&lt;<ul>                   <li>&lt;<ul>                       <li>&lt;<ul>                        <li>&lt;<ul>                        <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                         <li>&lt;<ul>                               <li>&lt;<ul></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></li></ul></pre> |

### Array Division (Before Improvement)

### The L1 busy rate is high because indirect load and store are used. Consequently, the following is a frequent event: No instruction commit due to L1D access for a floating-point load instruction.



## Effects of Array Division (Source Tuning)

Stride load and store are used since the loop is divided in such a way that the arrays are accessed with a stride of seven or fewer elements. This results in improvement of the following event: No instruction commit due to L1 access for a floating-point load instruction.





# Hindering Factor: Improvement in Data Dependency

- Loop That Has Data Dependency
- Loop That Has an Unclear Definition Reference Relationship
- Loop Containing Pointer Variables



# Loop That Has Data Dependency

Loop That Has Data Dependency (Before Improvement)
 Loop That Has Data Dependency (Source Tuning)

### Loop That Has Data Dependency (Before Improvement) Fujirsu

There is neither SIMD optimization nor effective software pipelining because array a has a data dependency that references data for i = 2 or greater as defined when i = 1.

Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.



#### SIMD

|                       | SIMD instruction rate (effective instruction) | SIMD floating point instruction rate (/SIMD target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate<br>(/SIMD target load-store instruction) |
|-----------------------|-----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------|
| Before<br>improvement | 0.00%                                         | 0.00%                                                                          | 0.00%                                                            | 0.00%                                                                     |
|                       |                                               |                                                                                |                                                                  |                                                                           |

There is no SIMD optimization.

Chapter 8 Loop That Has Data Dependency

### Loop That Has Data Dependency (Source Tuning)

SIMD optimization and software pipelining were facilitated through peeling. The result is a reduction in effective instruction, a decrease in instruction commits, facilitation of instruction scheduling, and a significant improvement in the following event: No instruction commit waiting for a floating-point instruction to be completed.





## Loop That Has an Unclear Definition Reference Relationship

- Loop That Has an Unclear Definition Reference Relationship (Before Improvement)
- Loop That Has an Unclear Definition Reference Relationship (Optimization Control Line Tuning)
- Loop That Has an Unclear Definition Reference Relationship (Optimization Control Line)

There is neither SIMD optimization nor effective software pipelining because of unclear data dependency regarding array a. Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.



SIMD

|                       | SIMD instruction rate (effective instruction) | SIMD floating point instruction rate (/SIMD target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate (/SIMD target load-store instruction) |
|-----------------------|-----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------------|
| Before<br>improvement | 0.00%                                         | 0.00%                                                                          | 0.00%                                                            | 0.00%                                                                  |
|                       |                                               |                                                                                |                                                                  |                                                                        |

#### There is no SIMD optimization.

Chapter 8 Loop That Has an Unclear Definition Reference Relationship

### Loop That Has an Unclear Definition Reference Relationship (Optimization Control Line Tuning) FUITSU

With no data dependency made explicit by the NORECURRENCE specifier, SIMD optimization and software pipelining were facilitated. The result is a reduction in the total number of effective instructions, a decrease in instruction commits, facilitation of instruction scheduling, and a significant improvement in the following event: No instruction commit waiting for a floating-point instruction to be completed.



Chapter 8 Loop That Has an Unclear Definition Reference Relationship



### Here, specify the following optimization control line.

| optimization control specifiers     | Meaning                                                                                                                                                                                                                                                                                                                                                            | Optimization control line that can be specified |                 |                   |                                          |
|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|-----------------|-------------------|------------------------------------------|
|                                     |                                                                                                                                                                                                                                                                                                                                                                    | Program unit                                    | DO loop<br>unit | Statement<br>unit | Array<br>assignment<br>statement<br>unit |
| NORECURRENCE<br>[(array1[,array2])] | Gives an instruction to the main processing<br>system about elements of the array that is the<br>operation target in a DO loop. The instruction is<br>that definitions of the array elements are not to<br>be referenced over different iterations.<br>(Gives an instruction to arrays for which loop<br>slicing is possible.)<br>array1, array2, are array names. | Yes                                             | Yes             | No                | Yes                                      |



# Loop Containing Pointer Variables

- Loop Containing Pointer Variables (Before Improvement)
- Loop Containing Pointer Variables (Optimization Control Line Tuning)
- Making Data Dependency Explicit Regarding Array Subscripts (Optimization Control Line)
- Speed-up by CONTIGUOUS attribute

#### Loop Containing Pointer Variables (Before Improvement) Fuirsu

SIMD optimization and software pipelining are not facilitated because the pointer variables of arrays a and b point to unknown memory areas. Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.



#### Loop Containing Pointer Variables (Optimization Control Line Tuning)

FUjitsu

With data dependency made explicit by the NOALIAS specifier, SIMD optimization and software pipelining were facilitated. This results in significant improvement of the following event: No instruction commit waiting for a floating-point instruction to be completed.



Chapter 8 Loop Containing Pointer Variables

#### Here, specify the following optimization control line.

|  |                                                                                                              | Optimization control line that can be specified |              |              |                   |                                          |
|--|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------|--------------|--------------|-------------------|------------------------------------------|
|  | Optimization control specifiers                                                                              | Meaning                                         | Program unit | DO loop unit | Statement<br>unit | Array<br>assignment<br>statement<br>unit |
|  | NOALIASGives an instruction that pointer<br>variables are not to share memory<br>areas with other variables. |                                                 | Yes          | Yes          | No                | No                                       |

## Speed-up by CONTIGUOUS attribute



# Optimization against arrays with pointer attribute may be promoted if CONTIGUOUS attribute be added to them.



In above case, XFILL optimization has been applied by specifying CONTIGUOUS attribute, then performance improved.

#### CONTIGUOUS attribute (Specification introduced by Fortran 2008)

CONTIGUOUS attribute can be specified to shape-assumed arrays or pointer arrays,

- $\cdot$  in case of shape-assumed arrays, they should associate with actual arguments having CONTIGUOUS attribute.
- in case of pointer arrays, they should associate with targets with CONTIGUOUS attribute.



# Hindering Factor: Improvement of a Loop with a Few Iterations

- Loop with a Few Iterations
- Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining
- Cloning
  - Before Improvement by Cloning
  - Cloning Source Tuning

## Loop with a Few Iterations



## Even for a loop for which software pipelining was facilitated, if the number of loop iterations is small, instruction scheduling may have a small effect.





## Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining

- Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining (Before Improvement)
- Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining (After Improvement)
- Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining (Optimization Control Line)
- Specification of an Appropriate Number of Unrollings and Suppression of Software Pipelining (Compiler Options)



#### Unrolling and software pipelining do not function effectively because the number of iterations is small. Consequently, the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.

| Source code before improvement                                                                        | [sec]                               |
|-------------------------------------------------------------------------------------------------------|-------------------------------------|
| <<< Loop-information Start >>>                                                                        | 7.0E-01                             |
| <<< [PARALLELIZATION]                                                                                 | 6.0E-01                             |
| <<< Standard iteration count: <u>534</u>                                                              |                                     |
| <<< [OPTIMIZATION] Number of unrollings is 6,                                                         | 5.0E-01                             |
| <pre>&lt;&lt;&lt; SIMD(VL: 4) with loop iterations count of <!--<--> SOFTWARE PIPELINING n = 16</pre> | 4.0E-01                             |
| <<< Loop-information End >>>                                                                          | 3.0E-01 No instruction              |
| 42 1 pp 6v do i = 1 , <mark>n</mark><br>43 1 p 6v b(i) = c0 + a(i)*(c1 + a(i)*(c2 + a(i)*(c3 + a(i)*  | 2.0E-01                             |
| 44 1 & (c4 + a(i)*(c5 + a(i)*(c6 + a(i)*(c7 + a(i)*                                                   | 1.0E-01 instruction to be completed |
| 45 1 & (c8 + a(i)*c9)))))))<br>46 1 p 6v enddo                                                        | 0.0E+00                             |
| 47 End                                                                                                | Before improvement                  |

Problem: The loop iteration count is small.

- Software pipelining loops are not executed.

Although "SOFTWARE PIPELINING" is displayed as optimization information, software pipelining loops are not executed because the number of iterations is small.

jwd8205o-i "sample.f", line 42: <u>At a loop iteration count of 120 or more</u>, a loop to which software pipelining is applied is selected at the execution time.

- The number of unrollings is not appropriate.

6 unrollings x 4 (SIMD) = 24 The unrolled loop is not even executed once.

The original loop is executed for all 16 iterations.

The execution of the original loop has a significant effect on performance.



Appropriate instruction scheduling was done with a specified number of unrollings appropriate to the number of iterations and with software pipelining suppressed. This resulted in the following event being significantly reduced: No instruction commit waiting for a floating-point instruction to be completed.





From the specification of 4 for the number of unrollings 4 unrollings x 4 (SIMD) = 16 The unrolled loop is executed for all 16 iterations.

#### Here, specify the following optimization control line.

|                        |                                                                                                                                          | Optimization control line that<br>can be specified |                 |                   |                                          |  |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|-----------------|-------------------|------------------------------------------|--|
| Optimization specifier | Meaning                                                                                                                                  | Program unit                                       | DO loop<br>unit | Statement<br>unit | Array<br>assignment<br>statement<br>unit |  |
| UNROLL( <i>m1</i> )    | Unrolls a DO loop.<br><i>m1</i> is a decimal number in a range of 2<br>to 100 that indicates the number of<br>unrollings (multiplicity). | No                                                 | Yes             | No                | No                                       |  |
| NOSWP                  | Disables the software pipelining function.                                                                                               | Yes                                                | Yes             | No                | Yes                                      |  |

## You can achieve effects similar to source tuning by specifying the following compiler options.

| Compiler options | Description of function                                                                                                                                                                                                                                                                                                                                                                                                              |
|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -Kunroll[=N]     | Gives an instruction to optimize loop unrolling. Specify an upper<br>limit in <i>N</i> for the number of loop unrollings. You can specify a<br>value in a range of 2 to 100 for <i>N</i> . If the specification of <i>N</i> is<br>omitted, the compiler automatically determines the optimal value.<br>If the -O0 or -O1 option is valid, the default is -Knounroll.<br>If -O2 or a higher option is valid, the default is -Kunroll. |
| -Knoswp          | Gives an instruction not to optimize software pipelining.                                                                                                                                                                                                                                                                                                                                                                            |

#### Use example (source code before improvement)

\$ frtpx -Kfast,parallel sample.f90 -Kunroll=8,noswp

### Before Improvement by Cloning



Even though the number of iterations of the innermost loop depends disproportionately on a specific condition, it is just a variable. For this reason, optimizations such as full unrolling are hindered, the number of instructions increases, and the following is a frequent event: No instruction commit waiting for a floating-point instruction to be completed.



## **Cloning Source Tuning**



Clone optimization control lines are specified to create a conditional branch in the innermost loop using variable values and to facilitate full unrolling and other optimizations. This results in significant improvement of the following event: No instruction commit waiting for a floating-point instruction to be completed.



 Effective instruction

 Before improvement
 1.34E+08

 After improvement
 6.71E+07



## Various Optimizations

- Rerolling
  - Rerolling (Before Improvement)
  - Effects of Rerolling (Source Tuning)
- Facilitation of SIMD Optimization through Changes to Simple Variables
- Inline expansion: procedures using associated allocatable variable

## Rerolling (Before Improvement)



### There are many instruction commits because SIMD optimization is unavailable due to the dependency in the innermost loop.

| Source code after improvemen                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | it (source tuning)                                                                | [sec]                                                            |                                                                          |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------|--------------------------------------------------------------------------|
| <<< Loop-information Start<br><<< [PARALLELIZATION]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | >>>                                                                               | 2.2E-01                                                          |                                                                          |
| <<< Standard iteration cou                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ınt: 2                                                                            | 2.0E-01                                                          | Four                                                                     |
| <pre>&lt;&lt;&lt; Loop-information End &gt; 42 1 pp DO K=1,M</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | >>>                                                                               | 1.8E-01                                                          | instructions<br>commit                                                   |
| <                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | >>>                                                                               | 1.6E-01<br>1.4E-01                                               | Two or three<br>instructions                                             |
| <<< [OPTIMIZATION]<br><<< SOFTWARE PIPELINING                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | G                                                                                 | 1.2E-01                                                          | commit                                                                   |
| <<< Loop-information End ><br>43 2 p 4s DO J=1,N                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | >>>                                                                               | 1.0E-01                                                          | One<br>instruction                                                       |
| 44 2 p 4s a(1,],K) = b(1,],K) +                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                   | 8.0E-02<br>6.0E-02                                               | commit                                                                   |
| 452p4s $a(2,J,K) = b(2,J,K) + 4b(2,J,K) + 4b(3,J,K) = b(3,J,K) + 4b(3,J,K) + 4b(3,J,K)$ |                                                                                   | 4.0E-02                                                          | No instruction<br>commit waiting<br>for a floating-<br>point instruction |
| 47 2 p 4s a(4,),K) = b(4,),K) +<br>48 2 p 4s ENDDO                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | + a(4, <mark>J-1</mark> ,K)                                                       | 2.0E-02                                                          | to be completed                                                          |
| 49 1 p ENDDO<br>50 end subroutine rerolling                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                   | 0.0E+00 Be                                                       | fore improvement                                                         |
| SIMD                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                   | The                                                              | re is no SIMD optimization.                                              |
| SIMD instruction rate (effective instruction)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | SIMD floating point instruction rate (/SIMD<br>target floating point instruction) | SIMD integer instruction rate (/SIMD target integer instruction) | SIMD load-store instruction rate (/SIMD target load-store instruction)   |
| Before 0.00%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 0.00%                                                                             | 0.00%                                                            | 0.00%                                                                    |

## Effects of Rerolling (Source Tuning)

Rewriting unrolled statements into loop statements (returning them to a loop) facilitates SIMD optimization. This results in a decreased effective instruction and improved performance.





## Changing arrays with constant subscripts to simple variables may facilitate SIMD optimization.

Example: Program change to simple variables

| Before correction             | After correction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| do i = 1, N                   | do i = 1, N                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| :                             | :                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| a(1) = b(1,i)                 | a1 = b(1,i)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| a(2) = b(2,i)                 | a2 = b(2,i)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| a(3) = b(3,i)                 | a3 = b(3,i)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| a(4) = b(4,i)                 | a4 = b(4,i)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| a(5) = b(5,i)                 | a5 = b(5,i)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| a(6) = b(6,i)                 | :                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| :                             | x = u(2) * a1 + u(3) * a2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| x = u(2) * a(1) + u(3) * a(2) | y = u(2) * a3 + u(3) * a4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| y = u(2) * a(3) + u(3) * a(4) | :                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| z = u(2) * a(4) + u(3) * a(6) | end do                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| :                             | end do                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| end do                        | interpret inte |



Procedure containing use-association of allocatable variable are not target of inline expansion. If allocatable variable is not referred to, copying the module and modifying to delete the use-association make the procedure will be target of inline expantion.





## Thread Parallelization Processing Tuning

- Thread Parallelization ratio Improvement
- Execution Efficiency Improvement of Thread Parallelization Processing



## Thread Parallelization Ratio Improvement

What Is the Thread Parallelization Ratio?
 Thread Parallelization Ratio Improvement

## What Is the Thread Parallelization Ratio?



The parallelization ratio means the percentage that can be executed in parallel in one parallel execution sequence.



Amdahl's law shows the relationship between the thread parallelization ratio and scalability in n parallel execution sequences.





## Thread Parallelization Ratio Improvement

- Loop That Has an Unclear Definition Reference Relationship
- Loop Containing Pointer Variables
- Loop That Has Data Dependency

#### IOCL NORECURRENCE

The main processing system cannot determine whether applying loop slicing to array a will cause a problem in the following problem, because the subscript expression of array a is an element of another array, y(j). If the programmer knows that loop slicing of array a will not cause a problem, the programmer can use parallelization by specifying the NORECURRENCE specifier.

| Source code before improvement                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Source code after improvement                                            |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|
| 5         6       1       s       8s       do i=1,1000         7       1       m       8m       a(y(i))=a(y(i))+b(i)         8       1       p       8v       end do         9       END         jwd5228p-i       "a.f90", line 7: This DO loop cannot be parallelized because the definition reference order of data differs from the order of sequential execution.         jwd6228s-i       "a.f90", line 7: This DO loop cannot be SIMD-optimized because the definition reference order of data differs from the order of sequential execution. | 5!ocl norecurrence(a)<<< Loop-information Start >>><<< [PARALLELIZATION] |

#### !Note!

- If the NORECURRENCE specifier is specified for an array for which loop slicing is not possible, the main processing system may apply the wrong loop slicing.
- If the array name is omitted, the specification is valid for all arrays in the target section.

## Loop Containing Pointer Variables

#### IOCL NOALIAS

Data dependency is unclear and there is no parallelization because the memory areas occupied by the pointer variables are determined at the execution time. If the programmer knows that the pointer variables do not point to the same memory area, the programmer can use parallelization by specifying the NOALIAS specifier.

| Source code before improvement                                                                               | Source code after improvement      |
|--------------------------------------------------------------------------------------------------------------|------------------------------------|
| 1 real,dimension(100000),target::x                                                                           | 1 real,dimension(100000),target::x |
| 2 real,dimension(:),pointer::a,b                                                                             | 2 real,dimension(:),pointer::a,b   |
| 3 a=>x(1:10000)                                                                                              | 3 a=>x(1:10000)                    |
| 4 b=>x(10001:20000)                                                                                          | 4 b=>x(10001:20000)                |
| 5                                                                                                            | 5 !ocl noalias                     |
| 6 1 s s do i=1,100000                                                                                        | <<< Loop-information Start >>>     |
| 7 1 s s b(i) = a(i)+1.0                                                                                      | <<< [PARALLELIZATION]              |
| 8 1 s s end do                                                                                               | <<< Standard iteration count: 1143 |
| 9 end                                                                                                        | <<< [OPTIMIZATION]                 |
|                                                                                                              | <<< SIMD(VL: 8)                    |
| jwd5228p-i "a.f90", line 7: This DO loop cannot be                                                           | <<< SOFTWARE PIPELINING            |
| parallelized because the definition reference order of                                                       | <<< Loop-information End >>>       |
| data differs from the order of sequential execution.                                                         | 6 1 pp 8v do i=1,100000            |
| jwd6228s-i  "a.f90", line 7: This DO loop cannot be<br>SIMD-optimized because the definition reference order | 7 1 p 8v b(i) = a(i)+1.0           |
| of data differs from the order of sequential execution.                                                      | 8 1 p 8v end do                    |
|                                                                                                              | 9 end                              |

## Loop That Has Data Dependency



#### Parallelization through peeling

The following loop is not parallelized because it has dependency regarding array a when i = 1 and when i = n. To facilitate parallelization, eliminate the dependency by placing the beginning and end parts of the loop outside the loop.

| Source code before improvement                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Source code after improvement                        |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
| <ul> <li>4 1 s 8s do i=1,n</li> <li>5 1 s 8m a(i)=a(1)+b(i)+a(n)</li> <li>6 1 s 8v end do</li> <li>jwd5202p-i "a.f90", line 5: This DO loop cannot be parallelized because the definition reference order of data differs from the order of sequential execution. (Name: a)</li> <li>jwd5208p-i "a.f90", line 5: The definition reference order is unknown and the reference order may differ from the order of sequential execution, so this DO loop cannot be parallelized. (Name: a)</li> </ul> | $\begin{array}{llllllllllllllllllllllllllllllllllll$ |



# Execution Efficiency Improvement of Thread Parallelization Processing

Improvement in False SharingImprovement in Load Imbalance



## Improvement in False Sharing

- What Is False Sharing?
- False Sharing (Before Improvement)
- False Sharing (Source Tuning)

## What Is False Sharing?

## False sharing is a phenomenon in which cache lines between threads are frequently invalidated or copied back. Thread 0 instruction to update s(1)



## False Sharing (Before Improvement)



False sharing occurs because the number of iterations of j, which is the parallelized dimension, is small at 16 and data in array a shares cache lines between threads. Consequently, data access wait is a frequent event.



#### Cache

|                       | lettective | L1D miss rate(/Load-<br>store instruction) | L1D miss |       |        | L1D miss swpf rate(/L1D<br>miss) | L2 miss rate(/Load-<br>store instruction) | ll / micc | L2 throughput | Memory<br>throughput<br>(GB/sec) |
|-----------------------|------------|--------------------------------------------|----------|-------|--------|----------------------------------|-------------------------------------------|-----------|---------------|----------------------------------|
| Before<br>improvement | 0.00%      | 24.98%                                     | 7.21E+08 | 0.21% | 99.79% | 0.00%                            | 0.00%                                     | 1.05E+04  | 355.62        | 0.01                             |

The percentage of L1D misses is high, and false sharing has occurred.

## False Sharing (Source Tuning)



False sharing can be avoided through loop interchange and parallelization outside the loop. This results in a decrease in the number of L1 cache misses and an improvement in data access wait.



#### Cache

|                    | L11 miss rate<br>(effective<br>instruction) | L1D miss rate(/Load-<br>store instruction) | L1D miss | L1D miss dm<br>rate(/L1D miss) | L1D miss hwpf<br>rate(/L1D miss) | L1D miss swpf<br>rate(/L1D miss) | L2 miss rate(/Load-<br>store instruction) | L2 miss  | L2 throughput | Memory<br>throughput<br>(GB/sec) |
|--------------------|---------------------------------------------|--------------------------------------------|----------|--------------------------------|----------------------------------|----------------------------------|-------------------------------------------|----------|---------------|----------------------------------|
| Before improvement | 0.00%                                       | 24.98%                                     | 7.21E+08 | 0.21%                          | 99.79%                           | 0.00%                            | 0.00%                                     | 1.05E+04 | 355.62        | 0.01                             |
| After improvement  | 0.00%                                       | 1.59%                                      | 4.59E+07 | 2.65%                          | 97.35%                           | 0.00%                            | 0.00%                                     | 1.09E+04 | 36.28         | 0.02                             |

#### Avoiding false sharing reduced the L1D miss and increased performance.



## Improvement in Load Imbalance

- Triangular Loop
- Loops with Irregular Amount of Calculation
- Small Loop Iteration Count of a Parallelized Dimension



## Triangular Loop

What Is a Triangular Loop?
 Triangular Loop (Before Improvement)
 Triangular Loop (OpenMP Tuning)

## What Is a Triangular Loop?

FUJITSU

A triangular loop is a loop in which the initial value and end value of an inner loop are determined by the control variable of an outer loop. If that loop is divided into blocks that are executed in parallel, a load imbalance occurs.



## Triangular Loop (Before Improvement)

A load imbalance occurs because the amount of calculation varies for individual threads. Consequently, the following is a frequent event: Synchronous waiting time between threads.







# Poor load balance between different threads!

## Triangular Loop (OpenMP Tuning)



After the processing amount is divided into small units allocated cyclically, the amount of calculation of each thread is uniform. The result is a load imbalance improvement and a decrease in the following event: Synchronous waiting time between threads.





Chapter 8 Triangular Loop



## Loops with Irregular Amount of Calculation

- Loop Containing an IF Construct (Before Improvement)
- Loop Containing an IF Construct (OpenMP Tuning)
- Loop with an Irregular Amount of Calculation (Before Improvement)
- Loop with an Irregular Amount of Calculation (OpenMP Tuning)

#### Loop Containing an IF Construct (Before Improvement)

FUĴITSU

The loop contains an IF construct. In this case, even if the amount of calculation varies between different threads, cyclic division with static specified as the scheduling method may not solve a load imbalance. Consequently, the following is a frequent event: Synchronous waiting time between threads.



Chapter 8 Loops with Irregular Amount of Calculation

### Loop Containing an IF Construct (OpenMP Tuning) Fuji

Load imbalance improves with dynamic as the scheduling method, since it allows a thread that completed processing earlier to execute the next process.



Chapter 8 Loops with Irregular Amount of Calculation

#### Loop with an Irregular Amount of Calculation (Before Improvement) Fuirsu

Even if the amount of calculation fluctuates irregularly, cyclic division with static specified as the scheduling method may not solve a load imbalance. Consequently, the following is a frequent event: Synchronous waiting time between threads.



#### Loop with an Irregular Amount of Calculation (OpenMP Tuning)Fujitsu

Load imbalance improves with dynamic as the scheduling method, since it allows a thread that completed processing earlier to execute the next process.



Chapter 8 Loops with Irregular Amount of Calculation



# Small Loop Iteration Count of a Parallelized Dimension

- Parallelization in an Appropriate Parallelized Dimension (Before Improvement)
- Parallelization in an Appropriate Parallelized Dimension (Optimization Control Line Tuning)
- Parallelization in an Appropriate Parallelized Dimension (Compiler Options Tuning)

#### Parallelization in an Appropriate Parallelized Dimension (Before Improvement) Fuirtsu

If the loop iteration count of the parallelized dimension is small and unknown at the compile time, a load imbalance occurs under the following condition: the number of iterations is smaller than the number of thread parallelization processes (16 parallel processes in this example). Consequently, the following is a frequent event: Synchronous waiting time between threads.



#### Before improvement

Chapter 8 Small Loop Iteration Count of a Parallelized Dimension

#### FUJITSU

# Specifying the SERIAL and PARALLEL specifiers realized parallelization in an appropriate dimension and improved load imbalance.



Chapter 8 Small Loop Iteration Count of a Parallelized Dimension

Parallelization in an Appropriate Parallelized Dimension (Compiler Options Tuning) FUITSU

With the compiler options -Kdynamic\_iteration specified, an appropriate parallelized dimension was automatically selected at the execution time, and load imbalance was improved.



Chapter 8 Small Loop Iteration Count of a Parallelized Dimension



# Usage Taking SSL2 Library Performance into Account (DGEMM)

- DGEMM Parameters
- DGEMM Parameters Appropriate to the FX100

# **DGEMM** Parameters



The following is a list of parameters for calling DGEMM.

- C := ALPHA x op(A) x op(B) + BETA x C
  - DGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA B, LDB, BETA, C, LDC)

| Argument       | Meaning                                                                             |  |
|----------------|-------------------------------------------------------------------------------------|--|
| TRANSA, TRANSB | They specify 'N' (do not transpose), 'T' (transpose), or 'C' (conjugate transpose). |  |
| M, N, K        | Integers indicating the matrix size                                                 |  |
| ALPHA, BETA    | Scalar values used in operation                                                     |  |
| А, В, С        | A: M x K matrix<br>B: K x N matrix<br>C: M x N matrix                               |  |
| LDA, LDB, LDC  | They specify the size of the 1st dimension of arrays A, B, and C, respectively.     |  |

## DGEMM Parameters Appropriate to the FX100 Fujitsu

- The recommended number of processes in a node is 2.
  - Performance is good with 16 or 32 threads, which enable utilization of a sector cache. (This is because the sector cache can effectively use L2\$.)
- We recommend that M, N, and K be as large as possible. That reduces the overhead of matrix copying done internally and the impact of the remaining part of a processing unit. Therefore, if they cannot be made larger by any means, try to improve efficiency as described below.
  - M should be a multiple of 32.
    - This is because the DGEMM kernel focuses on cases with processing in units of 32 elements (SIMD width of 4 x 8 registers) by combining 8 SIMD registers in the M direction.
  - N should be a multiple of 64 (when there are 16 threads).
    - This is because the DGEMM kernel focuses on cases with processing in units of 4 columns in the N direction.
    - If the size per thread is a multiple of 4 as a result of dividing N by the number of threads, the kernel is always used efficiently. However, if there is a remainder, efficiency decreases slightly.
  - K should be an even number.
    - This is because the DGEMM kernel focuses on cases with processing in units of 2 elements in the K direction.
- We recommend avoiding multiples of 2048 for LDA, LDB, and LDC.
  - This is because a multiple of 2048 may cause L1D\$ thrashing.

## **Revision History**



| Version | Date           | Revised section | Details           |
|---------|----------------|-----------------|-------------------|
| 2.0     | April 25, 2016 | -               | - First published |

# FUJTSU

shaping tomorrow with you