## Performance

Hung-Wei Tseng

## Recap: von Neuman Architecture





## By loading different programs into memory, your computer can perform different functions





Recap: How my "C code" becomes a "program"



00c2e800 0000008 00c2f000 0000008 00c2f800 0000008 00c30000 0000008

Linker

# **Source Code**

Compiler (e.g., gcc)

#### **Program**

Memory

0f00bb27 509cbd23 00005d24 0000bd24 2ca422a0 130020e4 00003d24 2ca4e2b3

00c2e800 80000008 00c2f000 Data 80000008 00c2f800 80000008 00c30000 80000008

## Recap: How my "Java code" becomes a "program"



### Recap: How my "Python code" becomes a "program"



## Definition of "Performance"

## **CPU Performance Equation**

$$Performance = \frac{1}{Execution \ Time}$$

Execution Time = 
$$\frac{Instructions}{Program} \times \frac{Cycles}{Instruction} \times \frac{Seconds}{Cycle}$$

$$ET = IC \times CPI \times CT$$

 $1GHz = 10^9 Hz = \frac{1}{10^9} sec \ per \ cycle = 1 \ ns \ per \ cycle$ 

Frequency(i.e., clock rate)

#### **Execution Time**

- The simplest kind of performance
- Shorter execution time means better performance
- Usually measured in seconds

#### instruction memory



## Performance Equation (X)

 Assume that we have an application composed with a total of 500000000 instructions, in which 20% of them are "Type-A" instructions with an average CPI of 8 cycles, 20% of them are "Type-B" instructions with an average CPI of 4 cycles and the rest instructions are "Type-C" instructions with average CPI of 1 cycle. If the processor runs at 3 GHz, how long is the execution time?

B. 5 sec

C. 6.67 sec

D. 15 sec

E. 45 sec

$$ET = (5 \times 10^{9}) \times (20\% \times 8 + 20\% \times 4 + 60\% \times 1) \times \frac{1}{3 \times 10^{-9}} sec = 5$$
average CPI

$$ET = IC \times CPI \times CT$$

## Speedup of Y over X

 Consider the same program on the following two machines, X and Y. By how much Y is faster than X?

|           | Clock<br>Rate | Instructions | Percentage of Type-A |    | Percentage of Type-B |   |     | CPI of<br>Type-C |
|-----------|---------------|--------------|----------------------|----|----------------------|---|-----|------------------|
| Machine X | 3 GHz         | 500000000    | 20%                  | 8  | 20%                  | 4 | 60% | 1                |
| Machine Y | 5 GHz         | 500000000    | 20%                  | 13 | 20%                  | 4 | 60% | 1                |



- B. 0.25
- C. 0.8
- D. 1.25
- E. No changes



## Speedup of Y over X

 Consider the same program on the following two machines, X and Y. By how much Y is faster than X?

|           | Clock<br>Rate | Instructions | Percentage of Type-A |    | Percentage of Type-B |   | Percentage of Type-C | CPI of<br>Type-C |
|-----------|---------------|--------------|----------------------|----|----------------------|---|----------------------|------------------|
| Machine X | 3 GHz         | 500000000    | 20%                  | 8  | 20%                  | 4 | 60%                  | 1                |
| Machine Y | 5 GHz         | 500000000    | 20%                  | 13 | 20%                  | 4 | 60%                  | 1                |



- B. 0.25
- C. 0.8
- D. 1.25
- E. No changes



## Speedup

The relative performance between two machines, X and Y. Y is n times faster than X

$$n = \frac{Execution \ Time_X}{Execution \ Time_Y}$$

The speedup of Y over X

$$Speedup = \frac{Execution \ Time_X}{Execution \ Time_Y}$$

## What Affects Each Factor in Performance Equation

## Use "performance counters" to figure out!

- Modern processors provides performance counters
  - instruction counts
  - cache accesses/misses
  - branch instructions/mis-predictions
- How to get their values?
  - You may use "perf stat" in linux
  - You may use Instruments —> Time Profiler on a Mac
  - Intel's vtune only works on Windows w/ intel processors
  - You can also create your own functions to obtain counter values

Recap: How my "C code" becomes a "program"



cafebabe 00c2e800 0000008 00c2f000 0000008 00c2f800 0000008 00c30000 00000008





509cbd23 00005d24 0000bd24 2ca422a0 Memory 130020e4 00003d24 2ca4e2b3

00c2e800 80000008 00c2f000 Data 80000008 00c2f800 80000008 00c30000 80000008

Compiler

**One Time Cost!** 

### Recap: How my "Java code" becomes a "program"



### Recap: How my "Python code" becomes a "program"



## How about "computational complexity"

- Algorithm complexity provides a good estimate on the performance if
  - Every instruction takes exactly the same amount of time
  - Every operation takes exactly the same amount of instructions

#### These are unlikely to be true

## **Summary of CPU Performance Equation**

$$Performance = \frac{1}{Execution \ Time}$$

$$Execution \ Time = \frac{Instructions}{Program} \times \frac{Cycles}{Instruction} \times \frac{Seconds}{Cycle}$$

$$ET = IC \times CPI \times CT$$

- IC (Instruction Count)
  - ISA, Compiler, algorithm, programming language, programmer
- CPI (Cycles Per Instruction)
  - Machine Implementation, microarchitecture, compiler, application, algorithm, programming language, programmer
- Cycle Time (Seconds Per Cycle)
  - Process Technology, microarchitecture, programmer

## Instruction Set Architecture (ISA) & Performance

#### Recap: ISA — the interface b/w processor/software

- Operations
  - Arithmetic/Logical, memory access, control-flow (e.g., branch, function calls)
  - Operands
    - Types of operands register, constant, memory addresses
    - Sizes of operands byte, 16-bit, 32-bit, 64-bit
- Memory space
  - The size of memory that programs can use
  - The addressing of each memory locations
  - The modes to represent those addresses

**Popular ISAs** 



















#### The abstracted "RISC-V" machine



### **Subset of RISC-V instructions**

| Category      | Instruction | Usage                     | Meaning                                |
|---------------|-------------|---------------------------|----------------------------------------|
| Arithmetic    | add         | add x1, x2, x3            | x1 = x2 + x3                           |
|               | addi        | addi x1,x2, 20            | x1 = x2 + 20                           |
|               | sub         | sub x1, x2, x3            | x1 = x2 - x3                           |
| Logical       | and         | and x1, x2, x3            | x1 = x2 & x3                           |
|               | or          | or x1, x2, x3             | $x1 = x2 \mid x3$                      |
|               | andi        | andi x1, x2, 20           | x1 = x2 & 20                           |
|               | sll         | sll x1, x2, 10            | $x1 = x2 * 2^10$                       |
|               | srl         | srl x1, x2, 10            | $x1 = x2 / 2^10$                       |
| Data Transfer | ld          | ld x1, 8(x2)              | x1 = mem[x2+8]                         |
|               | sd          | sd $x1$ , $8(x2)$ he only | type of instructions can access memory |
| Branch        | beq         | beq x1, x2, <b>25</b>     | if(x1 == x2), PC = PC + 100            |
|               | bne         | bne x1, x2, <b>25</b>     | if(x1 != x2), PC = PC + 100            |
| Jump          | jal         | jal <b>25</b>             | \$ra = PC + 4, PC = 100                |
|               | jr          | jr \$ra                   | PC = \$ra                              |

**Popular ISAs** 











## How many operations: CISC v.s. RISC

- CISC (Complex Instruction Set Computing)
  - Examples: x86, Motorola 68K
  - Provide many powerful/complex instructions
    - Many: more than 1503 instructions since 2016
    - Powerful/complex: an instruction can perform both ALU and memory operations
    - Each instruction takes more cycles to execute
- RISC (Reduced Instruction Set Computer)
  - Examples: ARMv8, RISC-V, MIPS (the first RISC instruction, invented by the authors of our textbook)
  - Each instruction only performs simple tasks
  - Easy to decode
  - Each instruction takes less cycles to execute

#### The abstracted x86 machine



## RISC-V v.s. x86

|                   | RISC-V                                   | x86                                                              |  |
|-------------------|------------------------------------------|------------------------------------------------------------------|--|
| ISA type          | Reduced Instruction Set Computers (RISC) | Complex Instruction Set Computers (CISC)                         |  |
| instruction width | 32 bits                                  | 1 ~ 17 bytes                                                     |  |
| code size         | larger                                   | smaller                                                          |  |
| registers         | 32                                       | 16                                                               |  |
| addressing modes  | reg+offset                               | base+offset<br>base+index<br>scaled+index<br>scaled+index+offset |  |
| hardware          | simple                                   | complex                                                          |  |

## Amdahl's Law — and It's Implication in the Multicore Era

#### Amdahl's Law



$$Speedup_{enhanced}(f, s) = \frac{1}{(1-f) + \frac{f}{s}}$$

f— The fraction of time in the original program

s — The speedup we can achieve on f

$$Speedup_{enhanced} = \frac{Execution \ Time_{baseline}}{Execution \ Time_{enhanced}}$$

#### Amdahl's Law

$$Speedup_{enhanced}(f, s) = \frac{1}{(1-f) + \frac{f}{s}}$$



baseline f

enhanced

f/s

**1-f** 

$$Speedup_{enhanced} = \frac{Execution \ Time_{baseline}}{Execution \ Time_{enhanced}} = \frac{1}{(1-f) + \frac{f}{s}}$$

## Recap: Speedup

- Assume that we have an application composed with a total of 500000 instructions, in which 20% of them are the load/store instructions with an average CPI of 6 cycles, and the rest instructions are integer instructions with average CPI of 1 cycle when using a 2GHz processor.
  - If we double the CPU clock rate to 4GHz that helps to accelerate all instructions by 2x except that load/store instruction cannot be improved their CPI will become 12 cycles. What's the performance improvement after this change?

A. No change 
$$ET = IC \times CPI \times CT$$

B. 1.25  $ET_{baseline} = (5 \times 10^5) \times (20\% \times 6 + 80\% \times 1) \times \frac{1}{2 \times 10^{-9}} sec = 5^{-3}$ 

C. 1.5  $ET_{enhanced} = (5 \times 10^5) \times (20\% \times 12 + 80\% \times 1) \times \frac{1}{4 \times 10^{-9}} sec = 4^{-3}$ 

D. 2  $Speedup = \frac{Execution\ Time_{baseline}}{Execution\ Time_{enhanced}}$ 

E. None of the above  $= \frac{5}{4} = 1.25$ 

## Replay using Amdahl's Law

- Assume that we have an application composed with a total of 500000 instructions, in which 20% of them are the load/store instructions with an average CPI of 6 cycles, and the rest instructions are integer instructions with average CPI of 1 cycle when using a 2GHz processor.
  - If we double the CPU clock rate to 4GHz that helps to accelerate all instructions by 2x except that load/store instruction cannot be improved their CPI will become 12 cycles. What's the performance improvement after this change?

How much time in load/store?  $500000 \times (0.2 \times 6) \times 0.5 \ ns = 300000 \ ns \rightarrow 60 \%$ How much time in the rest?  $500000 \times (0.8 \times 1) \times 0.5 \ ns = 200000 \ ns \rightarrow 40 \%$   $Speedup_{enhanced}(f, s) = \frac{1}{(1 - f) + \frac{f}{s}}$  $Speedup_{enhanced}(40 \% , 2) = \frac{1}{(1 - 40\%) + \frac{40 \%}{2}} = 1.25 \times 60000 \times (0.2 \times 6) \times 0.5 \ ns = 300000 \ ns \rightarrow 60 \%$ 

## Amdahl's Law on Multiple Optimizations

- We can apply Amdahl's law for multiple optimizations
- These optimizations must be dis-joint!
  - If optimization #1 and optimization #2 are dis-joint:

 $Speedup_{enhanced}(f_{Opt1}, f_{Opt2}, s_{Opt1}, s_{Opt2}) = \frac{1}{(1 - f_{Opt1} - f_{Opt2}) + \frac{f_{-}Opt1}{s_{-}Opt1} + \frac{f_{-}Opt2}{s_{-}Opt2}}$ 

If optimization #1 and optimization #2 are not dis-joint:

fonlyOpt1 fonlyOpt2 fBothOpt1Opt2 1-fonlyOpt1-fonlyOpt2-fBothOpt1Opt2

 $Speedup_{enhanced}(f_{OnlyOpt1}, f_{OnlyOpt2}, f_{BothOpt1Opt2}, s_{OnlyOpt1}, s_{OnlyOpt2}, s_{BothOpt1Opt2})$ 

 $(1 - f_{OnlyOpt1} - f_{OnlyOpt2} - f_{BothOpt1Opt2}) + \frac{f_{BothOpt1Opt2}}{s_{BothOpt1Opt2}} + \frac{f_{OnlyOpt1}}{s_{OnlyOpt1}} + \frac{f_{OnlyOpt1}}{s_{OnlyOpt1}}$ 

## Amdahl's Law Corollary #1

The maximum speedup is bounded by

$$Speedup_{max}(f, \infty) = \frac{1}{(1-f) + \frac{f}{\infty}}$$

$$Speedup_{max}(f, \infty) = \frac{1}{(1-f)}$$

## Corollary #1 on Multiple Optimizations

If we can pick just one thing to work on/optimize

| f <sub>1</sub>                                                                              | f <sub>2</sub>     | <b>f</b> <sub>3</sub> |
|---------------------------------------------------------------------------------------------|--------------------|-----------------------|
| $Speedup_{max}$                                                                             | $f(f_1, \infty) =$ | $\frac{1}{(1-f_1)}$   |
| Speedup <sub>max</sub> Speedup <sub>max</sub> Speedup <sub>max</sub> Speedup <sub>max</sub> | $f(f_2, \infty) =$ | $\frac{1}{(1-f_2)}$   |
| $Speedup_{max}$                                                                             | $f_3, \infty) =$   | $\frac{1}{(1-f_3)}$   |
| $Speedup_{max}$                                                                             | $f(f_4, \infty) =$ | $\frac{1}{(1-f_4)}$   |

The biggest  $f_x$  would lead to the largest  $Speedup_{max}$ !

 $1-f_1-f_2-f_3-f_4$ 

f<sub>4</sub>

## Corollary #2 — make the common case fast!

- When f is small, optimizations will have little effect.
- Common == most time consuming not necessarily the most frequent
- The uncommon case doesn't make much difference
- The common case can change based on inputs, compiler options, optimizations you've applied, etc.

## Identify the most time consuming part

- Compile your program with -pg flag
- Run the program
  - It will generate a gmon.out
  - gprof your\_program gmon.out > your\_program.prof
- It will give you the profiled result in your\_program.prof

#### If we repeatedly optimizing our design based on Amdahl's law...

#### **Storage Media**

CPU

#### Storage Media

#### **CPU**

- With optimization, the common becomes uncommon.
- An uncommon case will (hopefully) become the new common case.
- Now you have a new target for optimization.
- You have to revisit "Amdahl's Law" every time you applied some optimization

Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories Adrian M. Caulfield, Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, and Steven Swanson Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010.



### Don't hurt non-common part too mach

- If the program spend 90% in A, 10% in B. Assume that an optimization can accelerate A by 9x, by hurts B by 10x...
- Assume the original execution time is T. The new execution

time 
$$ET_{new} = \frac{ET_{old} \times 90\%}{9} + ET_{old} \times 10\% \times 10$$

$$ET_{new} = 1.1 \times ET_{old}$$

$$Speedup = \frac{ET_{old}}{ET_{new}} = \frac{ET_{old}}{1.1 \times ET_{old}} = 0.91 \times \dots \text{slowdown!}$$

You may not use Amdahl's Law for this case as Amdahl's Law does NOT

- (1) consider overhead
- (2) bound to slowdown



#### If we repeatedly optimizing our design based on Amdahl's law...



- With optimization, the common becomes uncommon.
- An uncommon case will (hopefully) become the new common case.
- Now you have a new target for optimization — You have to revisit "Amdahl's Law" every time you applied some optimization

Something else (e.g., data movement) matters more now

#### Amdahl's Law on Multicore Architectures

 Symmetric multicore processor with n cores (if we assume the processor performance scales perfectly)

$$Speedup_{parallel}(f_{parallelizable}, n) = \frac{1}{(1 - f_{parallelizable}) + \frac{f_{parallelizable}}{n}}$$

## Corollary #3

$$Speedup_{parallel}(f_{parallelizable}, \infty) = \frac{1}{(1 - f_{parallelizable}) + \frac{f_{parallelizable}}{\infty}}$$

$$Speedup_{parallel}(f_{parallelizable}, \infty) = \frac{1}{(1 - f_{parallelizable})}$$

- Single-core performance still matters
  - It will eventually dominate the performance
  - If we cannot improve single-core performance further, finding more "parallelizable" parts is more important

## Demo — merge sort v.s. bitonic sort on GPUs

## Merge Sort $O(nlog_2n)$

```
Bitonic Sort
           O(nlog_2^2n)
void BitonicSort() {
   int i,j,k;
   for (k=2; k<=N; k=2*k) {
       for (j=k>>1; j>0; j=j>>1) {
          for (i=0; i<N; i++) {
              int ij=i^j;
              if ((ij)>i) {
                 if ((i&k)==0 && a[i] > a[ij])
                     exchange(i,ij);
                 if ((i&k)!=0 && a[i] < a[ij])
                     exchange(i,ij);
```

## Merge sort



## Parallel merge sort



#### **Bitonic sort**



## Bitonic sort (cont.)



benefits — in-place merge (no additional space is necessary), very stable comparison patterns

O(n log<sup>2</sup> n) — hard to beat n(log n) if you can't parallelize this a lot!

## Corollary #4

$$Speedup_{parallel}(f_{parallelizable}, \infty) = \frac{1}{(1 - f_{parallelizable}) + \frac{f_{parallelizable}}{\infty}}$$

$$Speedup_{parallel}(f_{parallelizable}, \infty) = \frac{1}{(1 - f_{parallelizable})}$$

- · If we can build a processor with unlimited parallelism
  - The complexity doesn't matter as long as the algorithm can utilize all parallelism
  - That's why bitonic sort or MapReduce works!
- The future trend of software/application design is seeking for more parallelism rather than lower the computational complexity

## "Fair" Comparisons

Andrew Davison. Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. In Humour the Computer, MITP, 1995

V. Sze, Y.-H. Chen, T.-J. Yang and J. S. Emer. How to Evaluate Deep Neural Network Processors: TOPS/W (Alone) Considered Harmful. In IEEE Solid-State Circuits Magazine, vol. 12, no. 3, pp. 28-41, Summer 2020.

#### TFLOPS (Tera FLoating-point Operations Per Second)

Console Teraflops



#### Is TFLOPS (Tera FLoating-point Operations Per Second) a good metric?

$$TFLOPS = \frac{\# of floating point instructions \times 10^{-12}}{Exection Time}$$

$$= \frac{IC \times \% of floating point instructions \times 10^{-12}}{IC \times CPI \times CT}$$

$$= \frac{\% of floating point instructions \times 10^{-12}}{CPI \times CT}$$
IC is gone!

- Cannot compare different ISA/compiler
  - What if the compiler can generate code with fewer instructions?
  - What if new architecture has more IC but also lower CPI?
- Does not make sense if the application is not floating point intensive

#### TFLOPS (Tera FLoating-point Operations Per Second)

- Cannot compare different ISA/compiler
  - What if the compiler can generate code with fewer instructions?
  - What if new architecture has more IC but also lower CPI?
- Does not make sense if the application is not floating point intensive

|                  | TFLOPS | clock rate |
|------------------|--------|------------|
| Switch           | 1      | 921 MHz    |
| XBOX One X       | 6      | 1.75 GHz   |
| PS4 Pro          | 4      | 1.6 GHz    |
| GeForce GTX 2080 | 14.2   | 1.95 GHz   |







Server Config: Dual Xeon E5-2699 v4 2.6 GHz | 8X NVIDIA® Tesla® P100 or V100 | ResNet-50 Training on MXNet for 90 Epochs with 1.28M ImageNet Dataset.

#### AI TRAINING

From recognizing speech to training virtual personal assistants and teaching autonomous cars to drive, data scientists are taking on increasingly complex challenges with Al. Solving these kinds of problems requires training deep learning models that are exponentially growing in complexity, in a practical amount of time.

With 640 Tensor Cores, Tesla V100 is the world's first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance. The next generation of NVIDIA NVLink™ connects multiple V100 GPUs at up to 300 GB/s to create the world's most powerful computing servers. All models that would consume weeks of computing resources on previous systems can now be trained in a few days. With this dramatic reduction in training time, a whole new world of problems will now be solvable with AI.

#### The Most Advanced Data Center GPU Ever Built.

NVIDIA® Tesla® V100 is the world's most advanced data center GPU ever built to accelerate AI, HPC, and graphics. Powered by NVIDIA Volta, the latest GPU architecture, Tesla V100 offers the performance of up to 100 CPUs in a single GPU—enabling data scientists, researchers, and engineers to tackle challenges that were once thought impossible.





#### **SPECIFICATIONS**





Tesla V100 PCle Tesla V100 SXM2

|                                 | . 515           | -74.1-      |  |
|---------------------------------|-----------------|-------------|--|
| GPU Architecture                | NVIDIA Volta    |             |  |
| NVIDIA Tensor<br>Cores          | 640             |             |  |
| NVIDIA CUDA*<br>Cores           | 5,120           |             |  |
| Double-Precision<br>Performance | 7 TFLOPS        | 7.8 TFLOPS  |  |
| Single-Precision<br>Performance | 14 TFLOPS       | 15.7 TFLOPS |  |
| Tensor<br>Performance           | 112 TFLOPS      | 125 TFLOPS  |  |
| GPU Memory                      | 32GB /16GB HBM2 |             |  |

| or o Memory              | 320D / 100D 11D142 |               |  |
|--------------------------|--------------------|---------------|--|
| Memory<br>Bandwidth      | 900GB/sec          |               |  |
| ECC                      | Yes                |               |  |
| nterconnect<br>Bandwidth | 32GB/sec           | 300GB/sec     |  |
| System Interface         | PCIe Gen3          | NVIDIA NVLink |  |
| Form Factor              | PCIe Full          | SXM2          |  |

Height/Length

Max Power

1 GPU Node Replaces Up To 54 CPU Nodes
Node Replacement: HPC Mixed Workload

## They try to tell it's the better Al hardware

https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/

|                                 | K80<br>2012                    | TPU<br>2015 | P40<br>2016 |
|---------------------------------|--------------------------------|-------------|-------------|
| Inferences/Sec<br><10ms latency | <sup>1</sup> / <sub>13</sub> X | 1X          | 2X          |
| Training TOPS                   | 6 FP32                         | NA          | 12 FP32     |
| Inference TOPS                  | 6 FP32                         | 90 INT8     | 48 INT8     |
| On-chip Memory                  | 16 MB                          | 24 MB       | 11 MB       |
| Power                           | 300W                           | 75W         | 250W        |
| Bandwidth                       | 320 GB/S                       | 34 GB/S     | 350 GB/S    |

## Inference per second

$$\frac{Inferences}{Second} = \frac{Inferences}{Operation} \times \frac{Operations}{Second}$$

$$= \frac{Inferences}{Operation} \times \left[\frac{operations}{cycle} \times \frac{cycles}{second} \times \#\_of\_PEs \times Utilization\_of\_PEs\right]$$

|                                                                        | Hardware | Model | Input Data |
|------------------------------------------------------------------------|----------|-------|------------|
| Operations per inference                                               |          | V     |            |
| Operations per cycle                                                   | V        |       |            |
| Cycles per second                                                      | V        |       |            |
| Number of PEs                                                          | V        |       |            |
| Utilization of PEs                                                     | V        | V     |            |
| Effectual operations out of (total) operations                         |          | V     | V          |
| Effectual operations plus unexploited ineffectual operations per cycle | V        |       |            |

## What's wrong with inferences per second?

- There is no standard on how they inference but these affect!
  - What model?
  - What dataset?
- That's why Facebook is trying to promote an Al benchmark —

**MLPerf** 

Pitfall: For NN hardware, Inferences Per Second (IPS) is an inaccurate summary performance metric.

Our results show that IPS is a poor overall performance summary for NN hardware, as it's simply the inverse of the complexity of the typical inference in the application (e.g., the number, size, and type of NN layers). For example, the TPU runs the 4-layer MLP1 at 360,000 IPS but the 89-layer CNN1 at only 4,700 IPS, so TPU IPS vary by 75X! Thus, using IPS as the single-speed summary is even more misleading for NN accelerators than MIPS or FLOPS are for regular processors [23], so IPS should be even more disparaged. To compare NN machines better, we need a benchmark suite written at a high-level to port it to the wide variety of NN architectures. Fathom is a promising new attempt at such a benchmark suite [3].

# Choose the right metric — Latency v.s. Throughput/Bandwidth

## Latency v.s. Bandwidth/Throughput

- Latency the amount of time to finish an operation
  - Access time
  - Response time
- Throughput the amount of work can be done within a given period of time
  - Bandwidth (MB/Sec, GB/Sec, Mbps, Gbps)
  - IOPs (I/O operations per second)
  - FLOPs (Floating-point operations per second)
  - IPS (Inferences per second)

### With MLPerf, are we good with inferences/second?

• The following table shows the inference/second using ImageNet dataset and ResNet-50 v1.5 model as well as the number of maximum concurrent "inferences" each machine can support. If we are targeting as making decisions for autonomous cars — requires a decisionate decis

|                                                  | Intel® Xeon® Platinum 9200 processors (CPU) | Google Cloud TPU v3<br>(TPU)      | NVIDIA/Supermicro 4029GP-TRT-<br>OTO-28 8xT4 (GPU)      |
|--------------------------------------------------|---------------------------------------------|-----------------------------------|---------------------------------------------------------|
| Inferences per second                            | 5,965.62 <b>Bandw</b>                       | <b>idth</b> 32,716.00             | 44,977.80                                               |
| MXU                                              |                                             | 128*128*2                         | 4*4*320*8                                               |
| Number of Maximum Parallel Inferencing Instances | 224                                         | 128*2 = 256                       | 4*320*8 = 10240<br>https://mlperf.org/inference-results |
| A. CPU and TPU Bat<br>B. TPU and GPU             | tches/Sec $\frac{5965.62}{224}$ =           | $= 26.63  \frac{32716}{256} = 12$ |                                                         |
| C. Only GPU Second                               | onds/Batch $\frac{1}{26.63}$ =              | $= 37.55ms  \frac{1}{128} = 7.8$  | $\frac{1}{128} = 227.79ms$                              |

E. All would work well

## RAID — Improving throughput



114

(Burst mode)

## Latency/Delay v.s. Throughput





## Extreme Multitasking Performance

- Dual 4K external monitors
- 1080p device display
- 7 applications

## What's missing in this video clip?

- The ISA of the "competitor"
- Clock rate, CPU architecture, cache size, how many cores
- How big the RAM?
- How fast the disk?

## 12 ways to Fool the Masses When Giving Performance Results on Parallel Computers

- Quote only 32-bit performance results, not 64-bit results.
- Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.
- Quietly employ assembly code and other low-level language constructs.
- Scale up the problem size with the number of processors, but omit any mention of this fact.
- Quote performance results projected to a full system.
- Compare your results against scalar, unoptimized code on Crays.
- When direct run time comparisons are required, compare with an old code on an obsolete system.
- If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.
- Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.
- Mutilate the algorithm used in the parallel implementation to match the architecture.
- Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.
- If all else fails, show pretty pictures and animated videos, and don't talk about performance.