## Multithreaded Architectures and Programming on Multithreaded Architectures

Hung-Wei

SuperScalar Processor w/ ROB Fetch/decode instruction physical Renaming **Unresolved** X1 register # **Branch** logic **Physical** Register Registers mapping table Instruction -Queue Floating-Floating-Address Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue Memory Data Address

Recap: What about "linked list"

#### **Static instructions**

LOOP: ld X10, 8(X10) addi X7, X7, 1 bne X10, X0, LOOP

#### **Dynamic instructions**

- ld ILP is low because of data
- 3 bne X10 dependencies
- @ ld X10, 8(X10)
- s addi X7, X7, 1
- bne X10, X0, LOOP
- ① ld X10, 8(X10)
- ® addi X7, X7, 1
- 9 bne X10, X0, LOOP



9

10

## Demo: ILP within a program

 perf is a tool that captures performance counters of your processors and can generate results like branch mis-prediction rate, cache miss rates and ILP.



- The processor can schedule instructions from different threads/processes/programs
- Fetch instructions from different threads/processes to fill the not utilized part of pipeline
  - Exploit "thread level parallelism" (TLP) to solve the problem of insufficient ILP in a single thread
  - You need to create an illusion of multiple processors for OSs

```
X10, 8(X10)
① ld
                                                             ld
                                                                 X1, 0(X10)
② addi X7, X7, 1
                                                             addi X10, X10, 8
X20, X20, X1
                                                             add

    1d    X10, 8(X10)

                                6
                             5
                                                                 X10, X2, LOOP
                                                             bne
⑤ addi X7, X7, 1
                                                                 X1, 0(X10)
                                                             ld
 bne X10, X0, LOOP
                                                             addi X10, X10, 8
 ld X10, 8(X10)
                                10
                                                             add
                                                                 X20, X20, X1
                                    3
® addi X7, X7, 1
                                                                 X10, X2, LOOP
                                                             bne

    bne

     X10, X0, LOOP
                                           8)
                                               3
                                                                 X1, 0(X10)
                                                             ld
                                                             addi X10, X10, 8
                                                             add
                                                                X20, X20, X1
                                                             bne X10, X2, L00P
```

SuperScalar Processor w/ ROB Fetch/decode instruction physical Renaming **Unresolved** X1 register # **Branch** logic **Physical** Register Registers mapping table Instruction -Queue Floating-Floating-Address Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue **Memory** Data Address

SMT SuperScalar Processor w/ ROB physical register # Fetch/ PC #1 → Register decode Renaming PC #2 mapping table #1 instruction logic Physical Registers Register **Instruction** mapping table #2 Queue Floating-**Address** Floating-Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue **Memory** Data Address

- Improve the throughput of execution
  - May increase the latency of a single thread
- Less branch penalty per thread
- Increase hardware utilization
- Simple hardware design: Only need to duplicate PC/Register Files
- Real Case:
  - Intel HyperThreading (supports up to two threads per core)
    - Intel Pentium 4, Intel Atom, Intel Core i7
  - AMD RyZen (Zen microarchitecture)

SMT SuperScalar Processor w/ ROB physical register # Fetch/ PC #1 → Register decode Renaming PC#2 mapping table #1 instruction logic Physical Registers Register mapping table #2 Floating-**Address** Floating-Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue **Memory** Data Address

#### Wider-issue processors won't give you much more

| Program  | IPC | BP Rate | I cache<br>%MPCI | D cache<br>%MPCI | L2 cache<br>%MPCI |
|----------|-----|---------|------------------|------------------|-------------------|
| compress | 0.9 | 85.9    | 0.0              | 3.5              | 1.0               |
| eqntott  | 1.3 | 79.8    | 0.0              | 0.8              | 0.7               |
| m88ksim  | 1.4 | 91.7    | 2.2              | 0.4              | 0.0               |
| MPsim    | 0.8 | 78.7    | 5.1              | 2,3              | 2.3               |
| applu    | 0.9 | 79.2    | 0.0              | 2.0              | 1.7               |
| apsi     | 0.6 | 95.1    | 1.0              | 4.1              | 2.1               |
| swim     | 0.9 | 99.7    | 0.0              | 1.2              | 1.2               |
| tomcatv  | 0.8 | 99.6    | 0.0              | 7.7              | 2.2               |
| pmake    | 1.0 | 86.2    | 2.3              | 2.1              | 0.4               |

| Program  | IPC | BP Rate<br>% | I cache<br>%MPCI | D cache<br>%MPCI | L2 cache<br>%MPCI |
|----------|-----|--------------|------------------|------------------|-------------------|
| compress | 1.2 | 86.4         | 0.0              | 3.9              | 1.1               |
| eqntott  | 1.8 | 80.0         | 0.0              | 1.1              | 1.1               |
| m88ksim  | 2.3 | 92.6         | 0.1              | 0.0              | 0.0               |
| MPsim    | 1.2 | 81.6         | 3.4              | 1.7              | 2.3               |
| applu    | 1.7 | 79.7         | 0.0              | 2.8              | 2.8               |
| apsi     | 1.2 | 95.6         | 0.2              | 3.1              | 2.6               |
| swim     | 2.2 | 99.8         | 0.0              | 2.3              | 2.5               |
| tomcatv  | 1.3 | 99.7         | 0.0              | 4.2              | 4.3               |
| pmake    | 1.4 | 82.7         | 0.7              | 1.0              | 0.6               |

Table 5. Performance of a single 2-issue superscalar processor. Table 6. Performance of the 6-issue superscalar processor.

## Intel SkyLake









## The case for a Single-Chip Multiprocessor

Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Stanford University

#### Wide-issue SS processor v.s. multiple narrower-issue SS processors





## Intel SkyLake



## **Concept of CMP**



# Architectural Support for Parallel Programming

## Parallel programming

- To exploit parallelism you need to break your computation into multiple "processes" or multiple "threads"
- Processes (in OS/software systems)
  - Separate programs actually running (not sitting idle) on your computer at the same time.
  - Each process will have its own virtual memory space and you need explicitly exchange data using inter-process communication APIs
- Threads (in OS/software systems)
  - Independent portions of your program that can run in parallel
  - All threads share the same virtual memory space
- We will refer to these collectively as "threads"
  - A typical user system might have 1-8 actively running threads.
  - Servers can have more if needed (the sysadmins will hopefully configure it that way)

#### What software thinks about "multiprogramming" hardware



#### What software thinks about "multiprogramming" hardware



## **Coherency & Consistency**

- Coherency Guarantees all processors see the same value for a variable/memory address in the system when the processors need the value at the same time
  - What value should be seen
- Consistency All threads see the change of data in the same order
  - When the memory operation should be done

## Simple cache coherency protocol

- Snooping protocol
  - Each processor broadcasts / listens to cache misses
- State associate with each block (cacheline)
  - Invalid
    - The data in the current block is invalid
  - Shared
    - The processor can read the data
    - The data may also exist on other processors
  - Exclusive
    - The processor has full permission on the data
    - The processor is the only one that has up-to-date data

## Coherent way-associative cache set block and set block and

memory address: 0b0000100000100100 memory address: States D data data tag tag 01 01 0x00 **AABBCCDDEEGGFFHH IIJJKKLLMMNNOOPP** 0x29 **QQRRSSTTUUVVWWXX** 01 0x10 01 **IIJJKKLLMMNNOOPP OXDE** 01 01 0xA1 **QQRRSSTTUUVV**WWXX 0x10 **YYZZAABBCCDDEEFF** 0 00 0x10 **YYZZAABBCCDDEEFF** 00 0x8A **AABBCCDDEEGGFFHH** 10 0x31 **AABBCCDDEEGGFFHH** 10 **IIJJKKLLMMNNOOPP** 0x60 **IIJJKKLLMMNNOOPP** 10 0x45 10 0x70 **QQRRSSTTUUVV**WWXX 10 **QQRRSSTTUUVVWWXX** 10 **QQRRSSTTUUVVWWXX** 0x41 0x10 10 0x68 **YYZZAABBCCDDEEFF** 10 0x11 **YYZZAABBCCDDEEFF** 0x1 0 =? =? hit? hit?

**Snooping Protocol** 



#### What happens when we write in coherent caches?



### Cache coherency



#### 4Cs of cache misses

- 3Cs:
  - Compulsory, Conflict, Capacity
- Coherency miss:
  - A "block" invalidated because of the sharing among processors.

## False sharing

- True sharing
  - Processor A modifies X, processor B also want to access X.
- False sharing
  - Processor A modifies X, processor B also want to access Y.
     However, Y is invalidated because X and Y are in the same block!

#### Possible scenarios

62

```
Thread 1
               Thread 2
 a=1;
                  b=1;
                  y=a;
 x=b;
          (1,1)
Thread 1
               Thread 2
                  b=1;
                  y=a;
 a=1;
 x=b;
          (1,0)
```

```
Thread 2
Thread 1
  a=1;
  x=b;
                 b=1;
                 y=a;
         (0,1)
Thread 1
               Thread 2
                 y=a;
 x=b; OoO Scheduling!
                 b=1;
         (0,0)
```

## Why (0,0)?

- Processor/compiler may reorder your memory operations/ instructions
  - Coherence protocol can only guarantee the update of the same memory address
  - Processor can serve memory requests without cache miss first
  - Compiler may store values in registers and perform memory operations later
- Each processor core may not run at the same speed (cache misses, branch mis-prediction, I/O, voltage scaling and etc..)
- Threads may not be executed/scheduled right after it's spawned

## Again — how many values are possible?

 Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see?

```
① (0,0)
```

- **4** (1, 1)
- A. 0
- B. 1
- C. 2
- D. 3
- E. 4

```
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
volatile int a,b;
volatile int x,y;
volatile int f;
void* modifya(void *z) {
  a = 1;
  x=b;
  return NULL;
void* modifyb(void *z) {
  b=1;
  y=a;
  return NULL:
```

```
int main() {
  int i;
  pthread_t thread[2];
  pthread_create(&thread[0], NULL, modifya, NULL);
  pthread_create(&thread[1], NULL, modifyb, NULL);
  pthread_join(thread[0], NULL);
  pthread_join(thread[1], NULL);
  fprintf(stderr,"(%d, %d)\n",x,y);
  return 0;
}
```

#### fence instructions

- x86 provides an "mfence" instruction to prevent reordering across the fence instruction
- x86 only supports this kind of "relaxed consistency" model.
   You still have to be careful enough to make sure that your code behaves as you expected

| thread 1                                                          | thread 2                                                          |  |  |
|-------------------------------------------------------------------|-------------------------------------------------------------------|--|--|
| a=1;<br>mfence <b>a=1 must occur/update before mfence</b><br>x=b; | b=1;<br>mfence <b>b=1 must occur/update before mfence</b><br>y=a; |  |  |

## Take-aways of parallel programming

- Processor behaviors are non-deterministic
  - You cannot predict which processor is going faster
  - You cannot predict when OS is going to schedule your thread
- Cache coherency only guarantees that everyone would eventually have a coherent view of data, but not when
- Cache consistency is hard to support

## Simultaneous multithreading: maximizing on-chip parallelism

Dean M. Tullsen, Susan J. Eggers, Henry M. Levy
Department of Computer Science and Engineering, University of Washington



- The processor can schedule instructions from different threads/processes/programs
- Fetch instructions from different threads/processes to fill the not utilized part of pipeline
  - Exploit "thread level parallelism" (TLP) to solve the problem of insufficient ILP in a single thread
  - You need to create an illusion of multiple processors for OSs

```
X10, 8(X10)
① ld
                                                            ld
                                                                 X1, 0(X10)
② addi X7, X7, 1
                                                            addi X10, X10, 8
X20, X20, X1
                                                            add
    X10, 8(X10)
4 1d
                                6
                             5
                                                                X10, X2, LOOP
                                                            bne
⑤ addi X7, X7, 1
                                                                X1, 0(X10)
                                                            ld
 bne X10, X0, LOOP
                                                            addi X10, X10, 8
 ld X10, 8(X10)
                                10
                                                            add
                                                                X20, X20, X1
                                    3
® addi X7, X7, 1
                                                                X10, X2, LOOP
                                                            bne

    bne

     X10, X0, LOOP
                                          8)
                                               3
                                                                X1, 0(X10)
                                                            ld
                                                            addi X10, X10, 8
                                                            add
                                                                X20, X20, X1
                                                            bne X10, X2, L00P
                                       71
```

#### Architectural support for simultaneous multithreading

- To create an illusion of a multi-core processor and allow the core to run instructions from multiple threads concurrently, how many of the following units in the processor must be duplicated/extended?
  - ① Program counter
  - ② Register mapping tables
  - ③ Physical registers
  - 4 ALUs
  - ⑤ Data cache
  - ® Reorder buffer/Instruction Queue
  - A. 2
  - B. 3
  - C. 4
  - D. 5
  - E. 6

### Architectural support for simultaneous multithrea



- To create an illusion of a multi-core processor and allow the core to run instructions from multiple threads concurrently, how many of the following units in the processor must be duplicated/extended?
  - ① Program counter
  - ② Register mapping tables
  - ③ Physical registers
  - 4 ALUs
  - ⑤ Data cache
  - ® Reorder buffer/Instruction Queue
  - A. 2
  - B. 3
  - C. 4
  - D. 5
  - E. 6

#### Architectural support for simultaneous multithreading

- To create an illusion of a multi-core processor and allow the core to run instructions from multiple threads concurrently, how many of the following units in the processor must be duplicated/extended?
  - ① Program counter you need to have one for each context
  - 2 Register mapping tables you need to have one for each context
  - Physical registers you can share
  - 4 ALUsyou can share
  - Data cacheyou can share
  - ® Reorder buffer/Instruction Queue
  - A. 2 you need to indicate which context the instruction is from
  - B. 3
  - C. 4
  - D. 5
  - E. 6

SuperScalar Processor w/ ROB Fetch/decode instruction physical Renaming **Unresolved** X1 register # **Branch** logic **Physical** Register Registers mapping table Instruction -Queue Floating-Floating-Address Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue **Memory** Data Address

SMT SuperScalar Processor w/ ROB physical register # Fetch/ PC #1 → Register decode Renaming PC #2 mapping table #1 instruction logic Physical Registers Register **Instruction** mapping table #2 Queue Floating-**Address** Floating-Integer **Branch Point Adder Point Mul/Div** Resolution **ALU Store** Load Queue Queue **Memory** Data Address

- How many of the following about SMT are correct?
  - ① SMT makes processors with deep pipelines more tolerable to mis-predicted branches
  - ② SMT can improve the throughput of a single-threaded application
  - ③ SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width
  - MT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications.
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4



- How many of the following about SMT are correct?
  - ① SMT makes processors with deep pipelines more tolerable to mis-predicted branches
  - ② SMT can improve the throughput of a single-threaded application
  - ③ SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width
  - MT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications.
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4

- How many of the following about SMT are correct?
  - ① SMT makes processors with deep pipelines more tolerable to mis-predicted branches We can execute from other threads/contexts instead of the current one hurt, b/c you are sharing resource with other threads.
  - ② SMT can improve the throughput of a single-threaded application
  - ③ SMT processors can better utilize hardware during cache misses comparing with superscalar processors with the same issue width We can execute from other threads/ contexts instead of the current one
  - MT processors can have higher cache miss rates comparing with superscalar processors with the same cache sizes when executing the same set of applications.
  - A. 0

b/c we're sharing the cache

- B. 1
- C. 2
- D. 3
- E. 4