## **Dynamic Branch Prediction**

Hung-Wei Tseng



## **Recap: Pipelining**

ARTAN

Will.





## **Recap: Pipelining**



## **Recap: Pipelining**

add x1, x2, x3 IF ld x4, 0(x5) sub x6, x7, x8 sub x9, x10, x11 sd x1, 0(x12) xor x13, x14, x15 and x16, x17, x18 add x19, x20, x21 sub x22, x23, x24 ld x25, 4(x26) sd x27, 0(x28)

|   |    |    |     |     | _     |     |                        |  |
|---|----|----|-----|-----|-------|-----|------------------------|--|
| F | ID | EX | MEM | WB  |       |     |                        |  |
|   | IF | ID | EX  | MEM | WB    |     |                        |  |
|   |    | IF | ID  | EX  | MEM   | WB  |                        |  |
|   |    |    | IF  | ID  | EX    | MEM | WB                     |  |
|   |    |    |     | IF  | ID    | EX  | MEM                    |  |
|   |    |    |     |     | IF    | ID  | EX                     |  |
|   |    |    |     |     |       | IF  | ID                     |  |
|   |    |    |     |     |       |     | IF                     |  |
|   |    |    |     | we  | are c |     | nt,<br>leting<br>ach c |  |



## **Recap: Three pipeline hazards**

- Structural hazards resource conflicts cannot support simultaneous execution of instructions in the pipeline
- Control hazards the PC can be changed by an instruction in the pipeline
- Data hazards an instruction depending on a the result that's not yet generated or propagated when the instruction needs that



## **Recap: Solving Structural Hazards**

- Stall can address the issue but slow
- Improve the pipeline unit design to allow parallel execution



## **Recap: The impact of control hazards**

 Assuming that we have an application with 20% of branch instructions and the instruction stream incurs no data hazards. When there is a branch, we disable the instruction fetch and insert no-ops until we can determine the PC. What's the average CPI if we execute this program on the 5-stage RISC-V pipeline?

A. 1 B. 1.2 C. 1.4 D. 1.6 E. 1.8

| add | x1, | x2,   | x3  | IF | ID | EX | MEM | WB  |     |     |     |     |     |   |
|-----|-----|-------|-----|----|----|----|-----|-----|-----|-----|-----|-----|-----|---|
| ld  | x4, | 0(x5  | )   |    | IF | ID | EX  | MEM | WB  |     |     |     |     |   |
| bne | x0, | x7,   | L I |    |    | IF | ID  | EX  | MEM | WB  |     |     |     |   |
| add | x0, | ×0,   | x0  |    |    |    | IF  | ID  | EX  | MEM | WB  |     |     |   |
| add | x0, | ×0,   | x0  |    |    |    |     | IF  | ID  | EX  | MEM | WB  |     |   |
| sub | x9, | x10,x | 11  |    |    |    |     |     | IF  | ID  | EX  | MEM | WB  |   |
|     | -   | 0(x1  |     |    |    |    |     |     |     | IF  | ID  | EX  | MEM | V |



- 2-bit local predictor
- 2-level global predictor
- Hybrid predictors
- Branch and coding

# **Dynamic Branch Prediction**

## Why can't we proceed without stalls/no-ops?

- How many of the following statements are true regarding why we have to stall for each branch in the current pipeline processor
  - The target address when branch is taken is not available for instruction fetch stage of the next cycleYou need a cheatsheet for that — branch target buffer
  - ② The target address when branch is not-taken is not available for instruction fetch
  - stage of the next cycle. You need to predict that history/states The branch outcome cannot be decided until the comparison result of ALU is not out
  - The next instruction needs the branch instruction to write back its result 4
  - A. 0
  - B. 1
  - - D. 3

## A basic dynamic branch predictor





## **2-bit/Bimodal local predictor**

- Local predictor every branch instruction has its own state
- 2-bit each state is described using 2 bits
- Change the state based on actual outcome
- If we guess right no penalty
- If we guess wrong flush (clear pipeline registers) for mis-predicted instructions that are currently in IF and ID stages and reset the PC **(**)



|                      | branch PC | target PC | State     |
|----------------------|-----------|-----------|-----------|
|                      | 0x400048  | 0x400032  | 10        |
| <b>Predict Taken</b> | 0x400080  | 0x400068  | 11        |
|                      | 0x401080  | 0x401100  | <b>00</b> |
|                      | 0x4000F8  | 0x400100  | 01        |





## **2-bit local predictor**



| state | predict | actual |
|-------|---------|--------|
| 10    | Т       | Т      |
| 11    | Т       | NT     |

# **90% accuracy!**

## **2-bit local predictor**

• What's the overall branch prediction (include both branches) accuracy for this nested for loop?

```
i = 0;
                                                                 b
do {
    if( i % 2 != 0) // Brack taken if i % f == 0
a[i] *= 2; Can We do a
    a[i] += i;
} while ( ++i < 100)// Branch Y better job?</pre>
                                                               3
(assume all states started with 00)
                                                               3
  A. ~25%
                                                               4
  B. ~33%
                                                               4
  C. ~50%
                                                               5
  D. ~67%
                                                               5
                                    For branch Y, almost 100%,
                                                               6
  E. ~75%
                                    For branch X, only 50%
                                                               6
```

| ranch? | state | prediction | actual |
|--------|-------|------------|--------|
| Х      | 00    | NT         | Т      |
| Y      | 00    | NT         | Т      |
| Х      | 01    | NT         | NT     |
| Y      | 01    | NT         | Т      |
| Х      | 00    | NT         | Т      |
| Y      | 10    | Т          | Т      |
| Х      | 01    | NT         | NT     |
| Y      | 11    | Т          | Т      |
| Х      | 00    | NT         | Т      |
| Y      | 11    | Т          | Т      |
| Х      | 01    | NT         | NT     |
| Y      | 11    | Т          | Т      |
| Х      | 00    | NT         | Т      |
| Y      | 11    | Т          | Т      |

# **Two-level global predictor**

Reading: Scott McFarling. Combining Branch Predictors. Technical report WRL-TN-36, 1993.

## **2-bit local predictor**

• What's the overall branch prediction (include both branches) accuracy for this nested for loop?

```
i = 0;
                                                    i b
do {
   if( i % 2 != 0) // Branch X, taken if i % 2 == 0
      a[i] *= 2;
                         This pattern
   a[i] += i;
} while ( ++i < 100)// Branch</pre>
(assume all states stafepeats all the tin
 A. ~25%
                                                    3
  B. ~33%
                                                    4
  C. ~50%
                                                    4
  D. ~67%
                                                    5
                             For branch Y, almost 100%,
                                                    5
  E. ~75%
                             For branch X, only 50%
                                                    6
```

6

| ranch? | state | prediction | actual |
|--------|-------|------------|--------|
| Х      | 00    | NT         | Т      |
| Y      | 00    | NT         | Т      |
| Х      | 01    | NT         | NT     |
| Υ      | 01    | NT         | Т      |
| Х      | 00    | NT         | Т      |
| ne     | 10    | Т          | Т      |
| X      | 01    | NT         | NT     |
| Y      | 11    | Т          | Т      |
| Х      | 00    | NT         | Т      |
| Y      | 11    | Т          | Т      |
| Х      | 01    | NT         | NT     |
| Y      | 11    | Т          | Т      |
| Х      | 00    | NT         | Т      |
| Y      | 11    | Т          | Т      |







## **Performance of GH predictor**





| i  | branch? | GHR | state | prediction | actual |
|----|---------|-----|-------|------------|--------|
| 0  | Х       | 000 | 00    | NT         | Т      |
| 0  | Y       | 001 | 00    | NT         | Т      |
| 1  | Х       | 011 | 00    | NT         | NT     |
| 1  | Y       | 110 | 00    | NT         | Т      |
| 2  | Х       | 101 | 00    | NT         | Т      |
| 2  | Y       | 011 | 00    | NT         | Т      |
| 3  | Х       | 111 | 00    | NT         | NT     |
| 3  | Y       | 110 | 01    | NT         | Т      |
| 4  | Х       | 101 | 01    | NT         | Т      |
| 4  | Y       | 011 | 01    | NT         | Т      |
| 5  | Х       | 111 | 00    | NT         | NT     |
| 5  | Y       | 110 | 10    | Т          | Т      |
| 6  | Х       | 101 | 10    | Т          | Т      |
| 6  | Y       | 011 | 10    | Т          | Т      |
| 7  | Х       | 111 | 00    | NT         | NT     |
| 7  | Y       | 110 | 11    | Т          | Т      |
| 8  | Х       | 101 | 11    | Т          | Т      |
| 8  | Y       | 011 | 11    | Т          | Т      |
| 9  | Х       | 111 | 00    | NT         | NT     |
| 9  | Y       | 110 | 11    | Т          | Т      |
| 10 | Х       | 101 | 11    | Т          | Т      |
| 10 | Y       | 011 | 11    | Т          | Т      |

## **Better predictor?**

 Consider two predictors — (L) 2-bit local predictor with unlimited BTB entries and (G) 4-bit global history with 2-bit predictors. How many of the following code snippet would allow (G) to outperform (L)? about the same about the same  $G_{i} = 0$ ; i = 0;do { if( i % 10 != 0) do { a[i] \*= 2; a[i] += i;

} while ( ++i <</pre>

}







# Hybrid predictors

## gshare predictor







## gshare predictor

 Allowing the predictor to identify both branch address but also use global history for more accurate prediction



**00** 

## Local History Predictor branch PC local history

| x400048 | 1000 |
|---------|------|
| x400080 | 0110 |
| x401080 | 1010 |
| x4000F8 | 0110 |

## Predict Taken $\frac{3}{2}$

## **Tournament Predictor**

- The state predicts "which predictor is better"
  - Local history
  - Global history
- The predicted predictor makes the prediction



## **Branch predictor in processors**

- The Intel Pentium MMX, Pentium II, and Pentium III have local branch predictors with a local 4-bit history and a local pattern history table with 16 entries for each conditional jump.
- Global branch prediction is used in Intel Pentium M, Core, Core 2, and Silvermont-based Atom processors.
- Tournament predictor is used in DEC Alpha, AMD Athlon processors
- The AMD Ryzen multi-core processor's Infinity Fabric and the Samsung Exynos processor include a perceptron based neural branch predictor.



## **Branch and programming**

## **Demo revisited**

Why the sorting the array speed up the code despite the increased • instruction count?

```
if(option)
    std::sort(data, data + arraySize);
for (unsigned i = 0; i < 100000; ++i) {</pre>
    int threshold = std::rand();
    for (unsigned i = 0; i < arraySize; ++i) {</pre>
        if (data[i] >= threshold)
             sum ++;
    }
}
```

## **Demo: Popcount**

- The population count (or popcount) of a specific value is the number of set bits (i.e., bits in 1s) in that value.
- Applications
  - Parity bits in error correction/detection code
  - Cryptography
  - Sparse matrix
  - Molecular Fingerprinting
  - Implementation of some succinct data structures like bit vectors and wavelet trees.

## **Demo: pop count**

}

42

- Given a 64-bit integer number, find the number of 1s in its binary representation.
- Example 1: Input: 9487 Output: 7 Explanation: 9487's binary representation is Ob10010100001111

```
int main(int argc, char *argv[]) {
     uint64_t key = 0xdeadbeef;
     int count = 1000000000;
     uint64_t sum = 0;
     for (int i=0; i < count; i++)</pre>
     printf("Result: %lu\n", sum);
     return sum;
```

- sum += popcount(RandLFSR(key));

## **Four implementations**

• Which of the following implementations will perform the best on modern pipeline processors?





## Why is B better than A?





and x2, x1, 1 shr x4, x1, 1 shr x5, x1, 2 shr x6, x1, 3 shr x1, x1, 4 and x7, x4, 1 x8, x5, 1 and and x9, x6, 1 add x3, x3, x2 add x3, x3, x7 add x3, x3, x8 add x3, x3, x9 bne x1, x0, LOOP

## 52 Only one branch for four iterations in A

## Why is B better than A?

- How many of the following statements explains the reason why B outperforms A with compiler optimizations
  - B has lower dynamic instruction count than A
  - ② B has significantly lower branch mis-prediction rate than A
  - Ø B has significantly fewer branch instructions than A
  - B can incur fewer data hazards
  - A. 0
  - B. 1

C. 2

D. 3

E. 4

inline int popcount(uint64\_t x){ int c=0; while(x) c += x & 1;x = x >> 1;} return c; }



}



inline int popcount(uint64\_t x) { int c = 0;while(x) c += x & 1;x = x >> 1;c += x & 1;x = x >> 1;c += x & 1;x = x >> 1;c += x & 1;x = x >> 1;return c;

## Why is C better than B?

- How many of the following statements explains the reason why B outperforms C with compiler optimizations
  - C has lower dynamic instruction count than B C only needs one load, one add, one shift, the same amount of iterations
  - ② C has significantly lower branch mis-prediction rate than B
  - C has significantly fewer branch instructions than B
  - 4 C can incur fewer data hazards Probably not. In fact, the load may have negative



effect without architectural supports inline int popcount(uint64\_t x) { int c = 0;int table  $[16] = \{0, 1, 1, 2, 1,$ 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};  $\mathbf{\widehat{}}$ while(x)  $\mathbf{O}$ c += table[(x & 0xF)];x = x >> 4;} return c; }

}



# - the same amount of branches

```
inline int popcount(uint64_t x) {
   int c = 0;
   while(x)
     c += x \& 1;
     x = x >> 1;
     c += x & 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
     c += x \& 1;
     x = x >> 1;
   return c;
```

## Announcement

- Reading quiz due next Monday
- Homework #3 due next Wednesday
- Project due on 12/2 roughly three weeks from now
  - You can only turn-in "helper.c"
    - mcfutil.c:refresh\_potential() creates helper threads
    - mcfutil.c:refresh\_potential() callshelper\_thread\_sync() function periodically
    - It's your task to think what to do in helper\_thread\_sync() and helper\_thread() functions
    - Please DO READ papers before you ask what to do
  - Formula for grading min(100, speedup\*100)
  - No extension