# **Data Hazards & Dynamic Instruction Scheduling (I)**

Hung-Wei Tseng



## **Recap: Pipelining**



# **Recap: Pipelining**

add x1, x2, x3 ld x4, 0(x5) sub x6, x7, x8 sub x9, x10, x11 sd x1, 0(x12) xor x13, x14, x15 and x16, x17, x18 add x19, x20, x21 sub x22, x23, x24 ld x25, 4(x26) sd x27, 0(x28)

|   |    |    |     |                                                              | _   |     |     |  |
|---|----|----|-----|--------------------------------------------------------------|-----|-----|-----|--|
| F | ID | EX | MEM | WB                                                           |     |     |     |  |
|   | IF | ID | EX  | MEM                                                          | WB  |     |     |  |
|   |    | IF | ID  | EX                                                           | MEM | WB  |     |  |
|   |    |    | IF  | ID                                                           | EX  | MEM | WB  |  |
|   |    |    |     | IF                                                           | ID  | EX  | MEM |  |
|   |    |    |     |                                                              | IF  | ID  | EX  |  |
|   |    |    |     |                                                              |     | IF  | ID  |  |
|   |    |    |     |                                                              |     |     | IF  |  |
|   |    |    |     | After this point,<br>we are completing<br>instruction each c |     |     |     |  |



## **Recap: Three pipeline hazards**

- Structural hazards resource conflicts cannot support simultaneous execution of instructions in the pipeline
- Control hazards the PC can be changed by an instruction in the pipeline
- Data hazards an instruction depending on a the result that's not yet generated or propagated when the instruction needs that



# **Recap: Tips of drawing a pipeline diagram**

- Each instruction has to go through all 5 pipeline stages: IF, ID, EXE, MEM, WB in order — only valid if it's single-issue, RISC-V 5-stage pipeline
- An instruction can enter the next pipeline stage in the next cycle if
  - No other instruction is occupying the next stage
  - This instruction has completed its own work in the current stage
  - The next stage has all its inputs ready and it can retrieve those inputs
- Fetch a new instruction only if
  - We know the next PC to fetch
  - We can predict the next PC
  - Flush an instruction if the branch resolution says it's mis-predicted.
- Review your undergraduate architecture materials

## **Recap: addressing hazards**

- Structural hazards
  - Stall
  - Modify hardware design
- Control hazards
  - Stall
  - Static prediction
  - Dynamic prediction



## **Recap: 2-bit/Bimodal local predictor**

7

- Local predictor every branch instruction has its own state
- 2-bit each state is described using 2 bits
- Change the state based on actual outcome
- If we guess right no penalty
- If we guess wrong flush (clear pipeline registers) for mis-predicted instructions that are currently in IF and ID stages and reset the PC **(**)



|                      | branch PC | target PC | State     |
|----------------------|-----------|-----------|-----------|
|                      | 0x400048  | 0x400032  | 10        |
| <b>Predict Taken</b> | 0x400080  | 0x400068  | 11        |
|                      | 0x401080  | 0x401100  | <b>00</b> |
|                      | 0x4000F8  | 0x400100  | 01        |















**00** 

| ictor                 |      |    |  |  |
|-----------------------|------|----|--|--|
| Local                 |      |    |  |  |
| History               |      |    |  |  |
| Predictor             |      |    |  |  |
| anch PC local history |      |    |  |  |
| x400048 1000 >        |      |    |  |  |
| x400080               | 0110 | to |  |  |
|                       |      |    |  |  |

| 401080 | 1010 |
|--------|------|
| 4000F8 | 0110 |
|        |      |

# Predict Taken



## **Recap: Mapping Branch Prediction to NN (cont.)**

- Inputs (x's) are from branch history and are -1 or +1
- n + 1 small integer weights (w's) learned by on-line training
- Output (y) is dot product of x's and w's; predict taken if y
   0
- Training finds correlations between history and outcome



# $y = w_0 + \sum_{i=1}^n x_i w_i$

## **Four implementations**

• Which of the following implementations will perform the best on modern pipeline processors? inline int popcount(uint64\_t x) {

```
int c = 0;
   inline int popcount(uint64_t x){
                                                   while(x)
                                                               {
     int c=0;
                                                     c += x \& 1;
     while(x) {
                                                     x = x >> 1;
           c += x \& 1;
                                                     c += x & 1;
           x = x >> 1;
                                                     x = x >> 1;
        }
                                            \mathbf{m}
                                                     c += x & 1;
       return c;
                                                     x = x >> 1;
   }
                                                     c += x \& 1;
                                                     x = x >> 1;
                                                                      in
   inline int popcount(uint64_t x) {
         int c = 0;
                                                   return c;
         int table [16] = \{0, 1, 1, 2, 1, 
                                               }
   2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4;
         while(x)
\mathbf{O}
                                                                   c += table[(x \& 0xF)];
             x = x >> 4;
         }
         return c;
   }
                                                                        }
                                                       13
```



fe int popcount(uint64\_t x) { int c = 0;int table  $[16] = \{0, 1, 1, 2, 1,$ 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4; for (uint64\_t i = 0; i < 16; i++) c += table[(x & 0xF)];x = x >> 4;

## Why is B better than A?

14

- How many of the following statements explains the reason why B outperforms A with compiler optimizations
  - ① B has lower dynamic instruction count than A
  - ② B has significantly lower branch mis-prediction rate than A
  - ③ B has significantly fewer branch instructions than A
  - ④ B can incur fewer data memory accesses

A. 0 B. 1

C. 2

D. 3

E. 4

inline int popcount(uint64\_t x){ int c=0; while(x)  $\{$ c += x & 1;1 x = x >> 1;return c; }

}

 $\mathbf{m}$ 

https://www.pollev.com/hungweitseng close in 1:30

```
inline int popcount(uint64_t x) {
   int c = 0;
   while(x)
     c += x \& 1;
     x = x >> 1;
     c += x & 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
```

## Why is B better than A?





and x2, x1, 1 shr x4, x1, 1 shr x5, x1, 2 shr x6, x1, 3 shr x1, x1, 4 and x7, x4, 1 and x8, x5, 1 and x9, x6, 1 add x3, x3, x2 add x3, x3, x7 add x3, x3, x8 add x3, x3, x9 bne x1, x0, LOOP

## <sup>18</sup> Only one branch for four iterations in A

# Why is B better than A?

- How many of the following statements explains the reason why B outperforms A with compiler optimizations
  - B has lower dynamic instruction count than A
  - ② B has significantly lower branch mis-prediction rate than A
  - B has significantly fewer branch instructions than A
  - ④ B can incur fewer data memory accesses

A. 0 B. 1

C. 2

D. 3

E. 4

inline int popcount(uint64\_t x){ int c=0; while(x)  $\{$ c += x & 1;x = x >> 1;return c; }



}

m



```
inline int popcount(uint64_t x) {
   int c = 0;
   while(x)
     c += x \& 1;
     x = x >> 1;
     c += x & 1;
     x = x >> 1;
     c += x & 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
```

# Why is C better than B?

- How many of the following statements explains the reason why C outperforms B with compiler optimizations
  - ① C has lower dynamic instruction count than B
  - ② C has significantly lower branch mis-prediction rate than B
  - ③ C has significantly fewer branch instructions than B
  - ④ C can incur fewer data memory accesses

A. 0 B. 1 C. 2 D. 3 E. 4

```
inline int popcount(uint64_t x) {
         int c = 0;
         int table [16] = \{0, 1, 1, 2, 1, 
   2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
                                                 m
         while(x)
\mathbf{O}
              c += table[(x \& 0 \times F)];
             x = x >> 4;
         }
         return c;
    }
                                                     }
```

https://www.pollev.com/hungweitseng close in 1:30

```
inline int popcount(uint64_t x) {
   int c = 0;
   while(x)
     c += x \& 1;
     x = x >> 1;
     c += x & 1;
    x = x >> 1;
     c += x & 1;
    x = x >> 1;
     c += x \& 1;
     x = x >> 1;
```

# Why is C better than B?

- How many of the following statements explains the reason why C outperforms B with compiler optimizations
  - C has lower dynamic instruction count than B
    C only needs one load, one add, one shift, the same amount of iterations
  - ② C has significantly lower branch mis-prediction rate than B
  - 3 C has significantly fewer branch instructions than B the same amount of branches
  - 4 C can incur fewer data memory accesses Probably not. In fact, the load may have negative

A. 0 B. 1 C. 2 D. 3 E. 4

```
effect without architectural supports
   inline int popcount(uint64_t x) {
         int c = 0;
         int table [16] = \{0, 1, 1, 2, 1, 
   2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4;
                                                 m
         while(x)
\mathbf{O}
             c += table[(x \& 0xF)];
             x = x >> 4;
         }
         return c;
   }
```

}



```
inline int popcount(uint64_t x) {
   int c = 0;
   while(x)
     c += x \& 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
     c += x & 1;
     x = x >> 1;
```

## https://www.pollev.com/hungweitseng close in 1:30

## Why is D better than C?

- How many of the following statements explains the main reason why D outperforms C with compiler optimizations
  - ① D has lower dynamic instruction count than C
  - ② D has significantly lower branch mis-prediction rate than C
  - D has significantly fewer branch instructions than C (3)
  - ④ D can incur fewer memory accesses than C

A. 0 B. 1 C. 2 D. 3 E. 4

|   | 25                                           |                     |
|---|----------------------------------------------|---------------------|
|   | <pre>return c; }</pre>                       | retur<br>}          |
|   | }                                            | X<br>}              |
|   | c += table[(x & 0xF)];<br>x = x >> 4;        | С                   |
| C | <pre>while(x) {</pre>                        | {                   |
|   | 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};            | for (               |
|   | int table[16] = {0, 1, 1, 2, 1,              | int t<br>2, 2, 3, 1 |
|   | int $c = 0;$                                 | int c               |
|   | <pre>inline int popcount(uint64_t x) {</pre> | inline int          |
|   | -                                            |                     |

t popcount(uint64\_t x) { c = 0; $table[16] = \{0, 1, 1, 2, 1,$ 1, 2, 2, 3, 2, 3, 3, 4}; (uint64\_t i = 0; i < 16; i++) c += table[(x & 0xF)]; x = x >> 4;

## rn c;

# Why is D better than C?

- How many of the following statements explains the main reason why D outperforms C with compiler optimizations
  - Ø D has lower dynamic instruction count than C
  - Ø D has significantly lower branch mis-prediction rate than C
  - Ø D has significantly fewer branch instructions than C
  - ④ D can incur fewer memory accesses than C

inline int popcount(uint64\_t x) { A. 0 inline int popcount(uint64\_t x) { int c = 0;int c = 0;B. 1 int table  $[16] = \{0, 1, 1, 2, 1,$ 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4; 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4; C. 2 while(x) c += table[(x & 0xF)]; $c += table[(x \& 0 \times F)];$ D. 3 x = x >> 4;x = x >> 4;E. 4 return c; return c; } } 29



## - Compiler can do loop unrolling - no branches – Could be maybe eliminated through loop unrolling...

int table  $[16] = \{0, 1, 1, 2, 1,$ for (uint64\_t i = 0; i < 16; i++)</pre>

## All branches are gone with loop unrolling

```
inline int popcount(uint64_t x) {
     int c = 0;
     int table[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
          c += table[(x \& 0xF)];
          x = x >> 4;
     return c;
                                  30
}
```





- Data hazards
- Tomasulo's algorithm

## **Recap: Which swap is faster?**



 Both version A and B swaps content pointed by a and b correctly. Which version of code would have better performance?

A. Version A

- B. Version B
- C. They are about the same (sometimes A is faster, sometimes B is)



# Data hazards

## **Data hazards**

- An instruction currently in the pipeline cannot receive the "logically" correct value for execution
- Data dependencies
  - The output of an instruction is the input of a later instruction
  - May result in data hazard if the later instruction that consumes the result is still in the pipeline

## How many dependencies do we have?

• How many pairs of data dependences are there in the following RISC-V instructions?

| ld  | Χ6, | 0(X10) |
|-----|-----|--------|
| ld  | X7, | 0(X11) |
| add | X8, | X6, X0 |
| add | Χ6, | X7, X0 |
| add | X7, | X8, X0 |
| sd  | Χ6, | 0(X10) |
| sd  | X7, | 0(X11) |

int temp = \*a; \*a = \*b; \*b = temp;

- A. 1
- B. 2
- C. 3
- D. 4
- E. 5



## How many dependencies do we have?

How many pairs of data dependences are there in the following RISC-V instructions?



int temp = \*a; \*a = \*b; \*b = temp;



## Solution 1: Let's try "stall" again

 Whenever the input is not ready when the consumer is decoding, just stall — the consumer stays at ID.



## https://www.pollev.com/hungweitseng close in 1:30

## How many of data hazards?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline?

| ld  | X6, | 0(X10) |
|-----|-----|--------|
| ld  | X7, | 0(X11) |
| add | X8, | X6, X0 |
| add | Χ6, | X7, X0 |
| add | X7, | X8, X0 |
| sd  | Χ6, | 0(X10) |
| sd  | X7, | 0(X11) |
|     |     |        |

- A. 1
- B. 2
- C. 3 D. 4

E. 5



## How many of data hazards?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline?







## **Solution 2: Data forwarding**

- Add logics/wires to forward the desired values to the demanding instructions
- In our five stage pipeline if the instruction entering the EXE stage consumes a result from a previous instruction that is entering MEM stage or WB stage
  - A source of the instruction entering EXE stage is the destination of an instruction entering MEM/WB stage
  - The previous instruction must be an instruction that updates register file







## https://www.pollev.com/hungweitseng close in 1:30 How many of data hazards w/ Data Forwarding?

50

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline with "full" data forwarding?

| ld  | X6, | 0(X10) |
|-----|-----|--------|
| ld  | X7, | 0(X11) |
| add | X8, | X6, X0 |
| add | Χ6, | X7, X0 |
| add | X7, | X8, X0 |
| sd  | X6, | 0(X10) |
| sd  | X7, | 0(X11) |

- A. 0
- B. 1
- C. 2

## D. 3

E. 4

DataForwarding

А



С

## How many of data hazards w/ Data Forwarding?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline with "full" data forwarding?







## https://www.pollev.com/hungweitseng close in 1:30 How many of data hazards w/ Data Forwarding?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline with "full" data forwarding?

| ld  | X6, | 0(X10) |
|-----|-----|--------|
| ld  | X7, | 0(X11) |
| xor | Χ6, | X6, X7 |
| xor | X7, | X7, X6 |
| xor | Χ6, | X6, X7 |
| sd  | Χ6, | 0(X10) |
| sd  | X7, | 0(X11) |

- A. 0
- B. 1
- C. 2

## D. 3

E. 4

DataForwarding2

А



С

# How many of data hazards w/ Data Forwarding?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline with "full" data forwarding?







# How many of data hazards w/ Data Forwarding?

• How many pairs of instructions in the following RISC-V instructions will results in data hazards/stalls in a basic 5-stage RISC-V pipeline with "full" data forwarding?







## https://www.pollev.com/hungweitseng close in 1:30 The effect of code optimization

- By reordering which pair of the following instruction stream can we eliminate all stalls without affecting the correctness of the code?
  - 1 ld X6,0(X10)
  - 2 add X7,X6, X12
  - ③ sd X7,0(X10)
  - @ addi X10,X10, 8
  - ⑤ bne X10,X5, LOOP
  - A. (1) & (2)
  - B. (2) & (3)
  - C. (3) & (4)
  - D. (4) & (5)
  - E. None of the pairs can be reordered

CodeOptimization

А



D

С

# The effect of code optimization

- By reordering which pair of the following instruction stream can we eliminate all stalls without affecting the correctness of the code?
  - 1 ld X6,0(X10)
  - 2 add X7,X6, X12
  - ③ sd X7,0(X10)
  - @ addi X10,X10, 8
  - ⑤ bne X10,X5, LOOP
  - A. (1) & (2)
  - B. (2) & (3)
  - C. (3) & (4)
  - D. (4) & (5)

E. None of the pairs can be reordered



## https://www.pollev.com/hungweitseng close in 1:30

# If we can predict the future ...

- Consider the following dynamic instructions:
  - ld X6,0(X10) 1
  - 2 add X7,X6, X12
  - X7,0(X10) 3 sd
  - addi X10,X10, 8 4
  - bne X10,X5, LOOP 5
  - ld X6,0(X10) 6
  - ⑦ add X7,X6, X12
  - X7,0(X10) 8 sd
  - ⑨ addi X10,X10, 8
  - 10 bne X10,X5, LOOP

Which of the following pair can we reorder without affecting the correctness if the branch prediction is perfect?

- A. (2) and (4)
- B. (3) and (5)
- C. (5) and (6)
- D. (6) and (9)
- E. (9) and (10)



А

В

Start the presentation to see live-content. For screen share achieves, share the entire screen. Get help at policy.com/app.

С

D

Ε

# If we can predict the future ...

- Consider the following dynamic instructions:
  - ① ld X6,0(X10)
  - ② add X7,X6, X12
  - ③ sd X7,0(X10)
  - ⊛ addi X10,X10, 8
  - ⑤ bne X10,X5, LOOP
  - © ld X6,0(X10)
  - ⑦ add X7,X6, X12
  - sd X7,0(X10)
  - ⊙ addi X10,X10, 8
  - 10 bne X10,X5, LOOP

Can we use "branch prediction" to predict the future and reorder instructions across the branch?

Which of the following pair can we reorder without affecting the correctness if the **branch prediction is perfect**?

- A. (2) and (4)
- B. (3) and (5)
- C. (5) and (6)
- D. (6) and (9)
- E. (9) and (10)



# Dynamic instruction scheduling/ Out-of-order (OoO) execution

# **Tips of drawing a pipeline diagram**

- Each instruction has to go through all 5 pipeline stages: IF, ID, EXE, MEM, WB in order — only valid if it's single-issue, RISC-V 5-stage pipeline
- An instruction can enter the next pipeline stage in the next cycle if
  - No other instruction is occupying the next stage
  - This instruction has completed its own work in the current stage
  - The next stage has all its inputs ready
- Fetch a new instruction only if
  - We know the next PC to fetch
  - We can predict the next PC
  - Flush an instruction if the branch resolution says it's mis-predicted.

# What do you need to execution an instruction?

- Whenever the instruction is decoded put decoded instruction somewhere
- Whenever the inputs are ready all data dependencies are resolved
- Whenever the target functional unit is available

## Scheduling instructions: based on data dependencies

- Draw the data dependency graph, put an arrow if an instruction depends on the other.
  - X6,0(X10) ld 1
  - X7,X6,X12 2 add
  - X7,0(X10) 3 sd
  - ⊛ addi X10,X10,8
  - bne X10,X5,LOOP 5
  - ld X6,0(X10) 6
  - X7,X6,X12 ⑦ add
  - X7,0(X10) Image: Science Scie

  - 10 bne X10,X5,LOOP
- In theory, instructions without dependencies can be executed in parallel or out-of-order
- Instructions with dependencies can never be reordered



# If we can predict the future ...

- Consider the following dynamic instructions:
  - 1 ld X6,0(X10)
  - ② add X7,X6, X12
  - ③ sd X7,0(X10)
  - ⊛ addi X10,X10, 8
  - ⑤ bne X10,X5, LOOP
  - © ld X6,0(X10)
  - ⑦ add X7,X6, X12
  - sd X7,0(X10)
  - ⊙ addi X10,X10, 8
  - 10 bne X10,X5, LOOP

Which of the following pair can we reorder without affecting the correctness if the branch prediction is perfect?

- A. (2) and (4)
- B. (3) and (5)
- C. (5) and (6)
- D. (6) and (9)
- E. (9) and (10)







# **Announcements**

- Assignment #3 due next Wednesday
- Reading Quiz due 11/22
- Project is released
  - Please check website to the link of GitHub repo
  - You may discuss, but each needs an individual/distinguishable version of code
  - You need to write a brief report
  - Grading rubrics
    - $\cdot 20\%$  report
    - 20% if you code can compile and run
    - 60% performance based. The sample prefetcher is the baseline. We calculate your score at this part using min(Speedup-1, 1). If you can speedup by 2, you score full credits in this part
  - Due 11/29 no extension

# Computer Science & Engineering





