# **Memory Hierarchy**

Hung-Wei Tseng



### von Neuman Architecture



### By loading different programs into memory, your computer can perform different functions





Memory



### 509cbd23 AMON (intel) 00c2e800

### 00005d24 00c2f000 nstructio ta 0000bd24 0000008 0 2ca422a0 00c2f800 80000008 130020e4 00003d24 00c30000 2ca4e2b3 0000008

Storage

### **Performance gap between Processor/Memory**





### Performance of modern DRAM

|                 |           |           | Best case ac  | cess time (no pro | echarge)   | Precharge needed |
|-----------------|-----------|-----------|---------------|-------------------|------------|------------------|
| Production year | Chip size | DRAM type | RAS time (ns) | CAS time (ns)     | Total (ns) | Total (ns)       |
| 2000            | 256M bit  | DDR1      | 21            | 21                | 42         | 63               |
| 2002            | 512M bit  | DDR1      | 15            | 15                | 30         | 45               |
| 2004            | 1G bit    | DDR2      | 15            | 15                | 30         | 45               |
| 2006            | 2G bit    | DDR2      | 10            | 10                | 20         | 30               |
| 2010            | 4G bit    | DDR3      | 13            | 13                | 26         | 39               |
| 2016            | 8G bit    | DDR4      | 13            | 13                | 26         | 39               |

Figure 2.4 Capacity and access times for DDR SDRAMs by year of production. Access time is for a random memory word and assumes a new row must be opened. If the row is in a different bank, we assume the bank is precharged; if the row is not open, then a precharge is required, and the access time is longer. As the number of banks has increased, the ability to hide the precharge time has also increased. DDR4 SDRAMs were initially expected in 2014, but did not begin production until early 2016.





### **Alternatives?**

| Memory technology          | Typical access time        |
|----------------------------|----------------------------|
| SRAM semiconductor memory  | 0.5–2.5ns                  |
| DRAM semiconductor memory  | 50–70ns                    |
| Flash semiconductor memory | 5,000-50,000ns             |
| Magnetic disk              | 5,000,000-20,000,000ns     |
|                            | <b>Fast, but expensive</b> |

### **\$ per GiB in 2012**

### \$500-\$1000

### \$10-\$20

### \$0.75-\$1.00

### \$0.05-\$0.10

### e \$\$\$



# L1? L2? L3?



| CPU Cache     | s Mainboa              | ard Memo         | ory S   |  |  |  |  |  |  |
|---------------|------------------------|------------------|---------|--|--|--|--|--|--|
| Processor     |                        |                  |         |  |  |  |  |  |  |
| Name          |                        | Intel Core i7 97 |         |  |  |  |  |  |  |
| Code Name     | Coffe                  | e Lake           | Max     |  |  |  |  |  |  |
| Package       |                        | Socket 1         | 151 LG  |  |  |  |  |  |  |
| Technology    | 14 nm                  | /oltage          |         |  |  |  |  |  |  |
| Specification | In                     | tel® Core        | e™i7-97 |  |  |  |  |  |  |
| Family        | 6                      | Mo               | odel    |  |  |  |  |  |  |
| Ext. Family   | 6                      | Ext. Mo          | odel    |  |  |  |  |  |  |
| Instructions  | MMX, SSE,<br>AES, AVX, |                  | -       |  |  |  |  |  |  |
| Clocks (Core  | #0)                    |                  | -Cach   |  |  |  |  |  |  |
| Core Speed    | 4798.85                | 5 MHz            | L1D     |  |  |  |  |  |  |
| Multiplier    | x 48.0 (8              | 3 - 49 )         | L1 In   |  |  |  |  |  |  |
| Bus Speed     | 99.98                  | MHz              | Leve    |  |  |  |  |  |  |
| Rated FSB     |                        |                  | Leve    |  |  |  |  |  |  |
| Selection     | Socket #1              | <b>v</b>         |         |  |  |  |  |  |  |





# Why adding small SRAMs would work?



- Spatial locality application tends to visit nearby stuffs in the memory
  - Code the current instruction, and then PC + 4

# Most of time, your program is just visiting a very small amount of data/instructions within Code — loops, freque given window

Data — the same data can be read/write many times

# Architecting the Cache

### Processor Load/store only access a "word" each time

oad 0x000A

Core

Registers

| 0x0000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        | -           |        |          |        |             |        |             |             |
|--------|--------|--------|-------------|--------|--------|-------------|--------|--------|-------------|--------|--------|-------------|--------|-------------|-------------|--------|-------------|--------|----------|--------|-------------|--------|-------------|-------------|
| 0x1000 | AAAA   | BBBB   | CCCC        | DDDD   | EEEE   | FFFF        | GGGG   | НННН   | AAAA        | BBBB   | cccc   | DDDD        | EEEE   | FFFF        | GGGG        | НННН   | AAAA        | BBBB   | CCCC     | DDDD   | EEEE        | FFFF   | GGGG        | ннн         |
| 0x2000 | AAAA   | BBBB   | cccc        | DDDD   | EEEE   | FFFF        | GGGG   | НННН   | AAAA        | BBBB   | cccc   | DDDD        | EEEE   | FFFF        | GGGG        | нннн   | AAAA        | BBBB   | cccc     | DDDD   | EEEE        | FFFF   | GGGG        | ннн         |
|        |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             |             |
| 0x3000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             |             |
| 0x4000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             | -           |
| 0x5000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             | ┝──         |
| 0x6000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             | ┝──         |
| 0x7000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        | <b> </b> |        |             |        |             | ┡           |
| 0x8000 |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             |             |
|        | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •           | •           | •      | •           | •      | •        | •      | •           | •      | •           | •           |
|        | •<br>• | •<br>• | •<br>•      | •<br>• | •<br>•      | •<br>•      | •<br>• | •<br>•      | •<br>• | •<br>•   | •<br>• | •<br>•<br>• | •<br>• | •<br>•<br>• | •<br>•      |
|        | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •           | •           | •      | •           | •      | •<br>•   | •<br>• | •<br>•      | •      | •<br>•      | •           |
|        | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •           | •           | •      | •           | •      | •        | •      | •           | •      | •           | •           |
|        | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •      | •           | •      | •           | •           | •      | •           | •      | •        | •      | •           | •      | •           | •           |
|        | •<br>• | •<br>• | •<br>•<br>• | •<br>• | •<br>•<br>• | •<br>•<br>• | •<br>• | •<br>•<br>• | •<br>• | •<br>•   | •<br>• | •<br>•      | •<br>• | •<br>•      | •<br>•<br>• |
|        |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             | <b>[</b>    |
|        |        |        |             |        |        |             |        |        |             |        |        |             |        |             |             |        |             |        |          |        |             |        |             |             |



|                                                                                                                                        | CCC DDDD E EE FFFF GGGG HI |
|----------------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| AAAA BBBB CCCC DDDD EEEE FFFF GGGG HHHH AAAA BBBB CCCC DDDD EEEE FFFF GGGG HHHH AAAA BBBB CCCC DDDD EEEE FFFF GGGG HHHH AAAA BBBB CCCC | CCC DDDD EEEE FFFF GGGG HI |
|                                                                                                                                        |                            |
|                                                                                                                                        |                            |
|                                                                                                                                        |                            |
|                                                                                                                                        |                            |
|                                                                                                                                        |                            |
|                                                                                                                                        |                            |
|                                                                                                                                        | • • • •                    |
|                                                                                                                                        |                            |
| · · · · · · · · · · · · · · · · · · ·                                                                                                  |                            |



×××××××××××××  $\times \times \times$ 23456789ABCDEF This is CS 203: **r** Architecture! This is CS 203: Advanced Compute **r** Architecture! This is CS 203: Advanced Compute **r** Architecture! This is CS 203: Advanced Compute **r** Architecture! This is CS 203: **r** Architecture! This is CS 203:



# Tell if the block here can be used



| nap       | p            | ed         |    | cac        | he |
|-----------|--------------|------------|----|------------|----|
| -<br>0123 |              | ata<br>789 |    | CDEF       |    |
| This      |              |            |    |            |    |
| Advar     | nceo         |            | om | oute       |    |
| r Ar      | chit         | tec        | tu | re!        |    |
| This      | is           | CS         | 20 | <b>33:</b> |    |
| Advar     | nceo         | d C        | om | oute       |    |
| r Ar      | chit         | tec        | tu | re!        |    |
| This      | is           | CS         | 20 | <b>33:</b> |    |
| Advar     | nceo         |            | om | oute       |    |
| r Ar      | <u>chit</u>  | tec        | tu | re!        |    |
| This      | is           | CS         | 20 | <b>33:</b> |    |
| Advar     | nceo         |            | om | oute       |    |
| r Ar      | chit         | tec        | tu | re!        |    |
| This      | is           | CS         | 20 | <b>33:</b> |    |
| Advar     | nceo         |            | om | oute       |    |
| r Ar      | <u>chi</u> t | tec        | tu | re!        |    |
| This      | is           | CS         | 20 | 93:        |    |
|           |              |            |    |            |    |

### **Way-associative cache**





### C = ABS

- C: Capacity in data arrays
- A: Way-Associativity how many blocks within a set
  - N-way: N blocks in a set, A = N
  - 1 for direct-mapped cache
- **B**: **B**lock Size (Cacheline)
  - How many bytes in a block
- S: Number of Sets:
  - A set contains blocks sharing the same index
  - 1 for fully associate cache



# **Corollary of C = ABS**

block set index offset tag 0b000100000100100

memory address:

- number of bits in block offset lg(B)
- number of bits in set index: lg(S)
- tag bits: address\_length lg(S) lg(B)
  - address\_length is 32 bits for 32-bit machine
- (address / block\_size) % S = set index



# Put everything all together: How cache interacts with CPU





# Simulate the cache!



### Simulate a direct-mapped cache

- Consider a direct mapped (1-way) cache with 256 bytes total capacity, a block size of 16 bytes, and the application repeatedly reading the following memory addresses:
  - Ob100000000, Ob100001000, Ob1000010000, Ob1000010100, 0b1100010000
    - C = A B S
    - S=256/(16\*1) = 16
    - lg(16) = 4 : 4 bits are used for the index
    - lg(16) = 4 : 4 bits are used for the byte offset
    - The tag is 48 (4 + 4) = 40 bits
    - For example: 0b1000 0000 0000 0000 0000 0000 1000 0000

tag





### Simulate a direct-mapped cache

|    | V | D | Tag  | Data            |
|----|---|---|------|-----------------|
| 0  | 1 | 0 | 0b10 | r Architecture! |
| 1  | 1 | 0 | 0b10 | This is CS 203: |
| 2  | 0 | 0 |      |                 |
| 3  | 0 | 0 |      |                 |
| 4  | 0 | 0 |      |                 |
| 5  | 0 | 0 |      |                 |
| 6  | 0 | 0 |      |                 |
| 7  | 0 | 0 |      |                 |
| 8  | 0 | 0 |      |                 |
| 9  | 0 | 0 |      |                 |
| 10 | 0 | 0 |      |                 |
| 11 | 0 | 0 |      |                 |
| 12 | 0 | 0 |      |                 |
| 13 | 0 | 0 |      |                 |
| 14 | 0 | 0 |      |                 |
| 15 | 0 | 0 |      |                 |

tag index

| 0b | 10 | 0000 | 000 |
|----|----|------|-----|
| 0b | 10 | 0000 | 100 |
| 0b | 10 | 0001 | 000 |
| 0b | 10 | 0001 | 010 |
| 0b | 11 | 0001 | 000 |
| 0b | 10 | 0000 | 000 |
| 0b | 10 | 0000 | 100 |
| 0b | 10 | 0001 | 000 |
| 0b | 10 | 0001 | 010 |
|    |    |      |     |



- 90 90 90 90 00 90 90 90 90
- miss hit! miss
- hit!
- miss
  - hit!
  - hit!
- miss
- hit!

# Simulate a 2-way cache

- Consider a 2-way cache with 256 bytes total capacity, a block size of 16 bytes, and the application repeatedly reading the following memory addresses:
  - Ob100000000, Ob100001000, Ob1000010000, 0b1000010100, 0b1100010000
    - C = A B S
    - S=256/(16\*2) = 8
    - $8 = 2^3$ : 3 bits are used for the index
    - $16 = 2^4 : 4$  bits are used for the byte offset
    - The tag is 32 (3 + 4) = 25 bits
    - For example: 0b1000 0000 0000 0000 0000 0000 0001 0000

### tag





### Simulate a 2-way cache

| tag |  |
|-----|--|
|-----|--|

| 10 | 0                                |
|----|----------------------------------|
| 10 | 0                                |
| 10 | 0                                |
| 10 | 0                                |
| 11 | 0                                |
| 10 | 0                                |
| 10 | 0                                |
| 10 | 0                                |
| 10 | 0                                |
|    | 10<br>10<br>10<br>11<br>10<br>10 |

|     | V | D | Tag  | Data            | V | D | Tag  | Data             |
|-----|---|---|------|-----------------|---|---|------|------------------|
| 0 [ | 1 | 0 | 0b10 | r Architecture! | 0 | 0 |      |                  |
| 1   | 1 | 0 | 0b10 | This is CS 203: | 1 | 0 | 0b11 | Advanced Compute |
| 2   | 0 | 0 |      |                 | 0 | 0 |      |                  |
| 3   | 0 | 0 |      |                 | 0 | 0 |      |                  |
| 4   | 0 | 0 |      |                 | 0 | 0 |      |                  |
| 5   | 0 | 0 |      |                 | 0 | 0 |      |                  |
| 6   | 0 | 0 |      |                 | 0 | 0 |      |                  |
| 7   | 0 | 0 |      |                 | 0 | 0 |      |                  |



miss hit! miss hit! miss hit! hit! hit hit!

# **Cause of cache misses**



### **3Cs of misses**

- Compulsory miss
  - Cold start miss. First-time access to a block
- Capacity miss
  - The working set size of an application is bigger than cache size
- Conflict miss
  - Required data replaced by block(s) mapping to the same set
  - Similar collision in hash

# Simulate a direct-mapped cache

- Consider a direct mapped (1-way) cache with 256 bytes total capacity, a block size of 16 bytes, and the application repeatedly reading the following memory addresses:
  - Ob100000000, Ob100001000, Ob1000010000, Ob1000010100, 0b1100010000
    - C = A B S
    - S=256/(16\*1) = 16
    - lg(16) = 4 : 4 bits are used for the index
    - lg(16) = 4 : 4 bits are used for the byte offset
    - The tag is 48 (4 + 4) = 40 bits
    - For example: 0b1000 0000 0000 0000 0000 0000 1000 0000

tag







### Simulate a direct-mapped cache

|        | V | D | Tag  | Data |
|--------|---|---|------|------|
| 0      | 1 | 0 | 0b10 |      |
| 1      | 1 | 0 | 0b10 |      |
| 2      | 0 | 0 |      |      |
| 2<br>3 | 0 | 0 |      |      |
| 4      | 0 | 0 |      |      |
| 5      | 0 | 0 |      |      |
| 6      | 0 | 0 |      |      |
| 7      | 0 | 0 |      |      |
| 8      | 0 | 0 |      |      |
| 9      | 0 | 0 |      |      |
| 10     | 0 | 0 |      |      |
| 11     | 0 | 0 |      |      |
| 12     | 0 | 0 |      |      |
| 13     | 0 | 0 |      |      |
| 14     | 0 | 0 |      |      |
| 15     | 0 | 0 |      |      |

tag index

| 0b | 10 | 0000 | 000 |
|----|----|------|-----|
| 0b | 10 | 0000 | 100 |
| 0b | 10 | 0001 | 000 |
| 0b | 10 | 0001 | 010 |
| 0b | 11 | 0001 | 000 |
| 0b | 10 | 0000 | 000 |
| 0b | 10 | 0000 | 100 |
| 0b | 10 | 0001 | 000 |
| 0b | 10 | 0001 | 010 |
|    |    |      |     |



**compulsory miss** 90 hit! 90 compulsory miss 90 90 hit! 90 **compulsory miss** 90 hit! 90 hit! 90 conflict miss 90 hit!

# Simulate a 2-way cache

- Consider a 2-way cache with 256 bytes total capacity, a block size of 16 bytes, and the application repeatedly reading the following memory addresses:
  - Ob100000000, Ob100001000, Ob1000010000, 0b1000010100, 0b1100010000
    - C = A B S
    - S=256/(16\*2) = 8
    - $8 = 2^3$ : 3 bits are used for the index
    - $16 = 2^4 : 4$  bits are used for the byte offset
    - The tag is 32 (3 + 4) = 25 bits
    - For example: 0b1000 0000 0000 0000 0000 0000 0001 0000

### tag





### Simulate a 2-way cache

|   | V | D | Tag  | Data | V | D | Tag  | Data |          |
|---|---|---|------|------|---|---|------|------|----------|
| 0 | 1 | 0 | 0b10 |      | 0 | 0 |      |      |          |
| 1 | 1 | 0 | 0b10 |      | 1 | 0 | 0b11 |      |          |
| 2 | 0 | 0 |      |      | 0 | 0 |      |      |          |
| 3 | 0 | 0 |      |      | 0 | 0 |      |      |          |
| 4 | 0 | 0 |      |      | 0 | 0 |      |      | <b>c</b> |
| 5 | 0 | 0 |      |      | 0 | 0 |      |      |          |
| 6 | 0 | 0 |      |      | 0 | 0 |      |      |          |
| 7 | 0 | 0 |      |      | 0 | 0 |      |      |          |

tag index 0b10 0000 0b10 0001 0b10 0000 0b10 0000 0b10 0001 0b10 0001



### 0b10 0000 0000 compulsory miss hit! 1000 0b10 0001 0000 compulsory miss 0100 hit! 0b11 0001 0000 compulsory miss 0000 hit! 1000 hit! 0000 hit 0100 hit!

# Improving 3Cs

### Improvement of 3Cs

- 3Cs and A, B, C of caches
  - Compulsory miss
    - Increase B: increase miss penalty (more data must be fetched from lower) hierarchy)
  - Capacity miss
    - Increase C: increase cost, access time, power
  - Conflict miss
    - Increase A: increase access time and power
- Or modify the memory access pattern of your program!



# Programming and memory performance

# Data layout

# Memory addressing/alignment

- Almost every popular ISA architecture uses "byte-addressing" to access memory locations
- Instructions generally work faster when the given memory address is aligned
  - Aligned if an instruction accesses an object of size n at address X, the access is aligned if  $X \mod n = 0$ .
  - Some architecture/processor does not support aligned access at all
  - Therefore, compilers only allocate objects on "aligned" address



### **Array of structures or structure of arrays**

| Array of objects |                           |                                                                                     |                                                                                                                                                                                                                                                    |    | object o  |         |  |                                                                                                                                                                                                                                                               |          |  |
|------------------|---------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-----------|---------|--|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--|
|                  |                           | <pre>struct grades {     int id;     double *homework;     double average; };</pre> |                                                                                                                                                                                                                                                    |    |           |         |  | <pre>struct grades {     int *id;     double **homework;     double *average; };</pre>                                                                                                                                                                        |          |  |
|                  |                           |                                                                                     |                                                                                                                                                                                                                                                    |    |           |         |  | ID                                                                                                                                                                                                                                                            | ID       |  |
| ID               | *hom                      | ework                                                                               | average                                                                                                                                                                                                                                            | ID | *homework | average |  | homework                                                                                                                                                                                                                                                      | homework |  |
|                  |                           |                                                                                     |                                                                                                                                                                                                                                                    |    |           |         |  | average                                                                                                                                                                                                                                                       | average  |  |
| •                | erage of each<br>homework |                                                                                     | <pre>for(i=0;i<homework_items; (double)total_number_students;="" +="gradesheet[j].homework[i];" =="" for(j="0;j&lt;total_number_students;j++)" gradesheet[total_number_students].homework[i]="" i++)="" pre="" {="" }<=""></homework_items;></pre> |    |           |         |  | <pre>for(i = 0;i &lt; homework_items; i++ {    gradesheet.homework[i][total_num    for(j = 0; j <total_number_stude gradesheet.homework[i][j];="" gradesheet.homework[i][tota="" pre="" total_number_students;="" {="" }="" }<=""></total_number_stude></pre> |          |  |





# What data structure is performing better

|                             | Array of objects                                                                                                                                                                                                                                   | object o                                                                                                                                                                                                                                                   |  |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                             | <pre>struct grades {     int id;     double *homework;     double average; };</pre>                                                                                                                                                                | <pre>struct grades {     int *id;     double **homework;     double *average; };</pre>                                                                                                                                                                     |  |
| average of each<br>homework | <pre>for(i=0;i<homework_items; (double)total_number_students;="" +="gradesheet[j].homework[i];" =="" for(j="0;j&lt;total_number_students;j++)" gradesheet[total_number_students].homework[i]="" i++)="" pre="" {="" }<=""></homework_items;></pre> | <pre>for(i = 0;i &lt; homework_items; i++ {    gradesheet.homework[i][total_nu    for(j = 0; j <total_number_stud gradesheet.homework[i][j];="" gradesheet.homework[i][tota="" pre="" total_number_students;="" {="" }="" }<=""></total_number_stud></pre> |  |

- Considering your workload would like to calculate the average score of one of the homework for all students, which data structure would deliver better performance? What if we want to calculate average scores for each student?
  - A. Array of objects

B. Object of arrays

### of arrays

+)

umber\_students] = 0.0; dents; j++)

al\_number\_students] +=

al\_number\_students] /=

### **Column-store or row-store**

• If you're designing an in-memory database system, will you be using

| Rowld | Empld | Lastname | Firstname | Salary |
|-------|-------|----------|-----------|--------|
| 1     | 10    | Smith    | Joe       | 40000  |
| 2     | 12    | Jones    | Mary      | 50000  |
| 3     | 11    | Johnson  | Cathy     | 44000  |
| 4     | 22    | Jones    | Bob       | 55000  |

• column-store — stores data tables column by column

10:001,12:002,11:003,22:004; Smith:001, Jones:002, Johnson:003, Jones:004 select Lastname, Firstname from table Joe:001, Mary:002, Cathy:003, Bob:004; 40000:001,50000:002,44000:003,55000:004;

row-store — stores data tables row by row

001:10,Smith,Joe,40000; 002:12, Jones, Mary, 50000; 003:11, Johnson, Cathy, 44000; 004:22, Jones, Bob, 55000;



### if the most frequently used query looks like -

# Loop interchange/fission/fusion

# Demo — programmer & performance

for(i = 0; i < ARRAY\_SIZE; i++)</pre>  $\{$ for(j = 0; j < ARRAY\_SIZE; j++)</pre> c[i][j] = a[i][j]+b[i][j]; } }

for(j = 0;
{
 for(i = 0
 {
 c[i][j]
 }
}

 $O(n^2)$ ComplexitySameInstruction Count?SameClock RateBetterCPI

- for(j = 0; j < ARRAY\_SIZE; j++)</pre>
  - for(i = 0; i < ARRAY\_SIZE; i++)</pre>

c[i][j] = a[i][j]+b[i][j];

 $O(n^2)$ 







### **AMD Phenom II**

- D-L1 Cache configuration of AMD Phenom II
  - Size 64KB, 2-way set associativity, 64B block, LRU policy, write-allocate, write-back, and assuming 32-bit address.

```
int a[16384], b[16384], c[16384];
/* c = 0x10000, a = 0x20000, b = 0x30000 */
for(i = 0; i < 512; i++) {
    c[i] = a[i] + b[i];
   //load a, b, and then store to c
}
```

What's the data cache miss rate for this code?

A. 6.25% C = ABSB. 56.25% 64KB = 2 \* 64 \* SC. 66.67% S = 512offset = lg(64) = 6 bits 68.75% D. index = lg(512) = 9 bits E. 100% tag = 64 - lg(512) - lg(64) = 49 bits

### **Loop Fusion**

```
/* Before */
for (i = 0; i < N; i = i+1)
    for (j = 0; j < N; j = j+1)
        a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
    for (j = 0; j < N; j = j+1)
        d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
    for (j = 0; j < N; j = j+1)
    {
        a[i][j] = 1/b[i][j] * c[i][j];
         d[i][j] = a[i][j] + c[i][j];
    }
```

2 misses per access to a & c vs. one miss per access

# Blocking

### **Case study: Matrix Multiplication**

```
for(i = 0; i < ARRAY_SIZE; i++) {</pre>
                                        Algorithm class tells you it's O(n<sup>3</sup>)
 for(j = 0; j < ARRAY_SIZE; j++) {</pre>
   for(k = 0; k < ARRAY_SIZE; k++) {</pre>
                                          If n=1024, it takes about 1 sec
     c[i][j] += a[i][k]*b[k][j];
   }
 }
                                   How long is it take when n=2048?
}
```



# **Matrix Multiplication**



- If each dimension of your matrix is 2048
  - Each row takes 2048\*8 bytes = 16KB
  - The L1 \$ of intel Core i7 is 32KB, 8-way, 64-byte blocked
  - You can only hold at most 2 rows/columns of each matrix!
  - You need the same row when j increase!

### Very likely a miss if array is large



b

### **Block algorithm for matrix multiplication**

- Discover the cache miss rate
  - valgrind --tool=cachegrind cmd
    - cachegrind is a tool profiling the cache performance
  - Performance counter
    - Intel<sup>®</sup> Performance Counter Monitor http://www.intel.com/software/pcm/



### **Block algorithm for matrix multiplication**



You only need to hold these sub-matrices in your cache

for(kk = k; kk < k+(ARRAY\_SIZE/n); kk++)</pre>



### **Matrix Transpose**

```
// Transpose matrix b into b_t
                                                                        b_t[i][j] += b[j][i];
                                                                    }
for(i = 0; i < ARRAY_SIZE; i+=(ARRAY_SIZE/n)) {</pre>
                                                                  }
  for(j = 0; j < ARRAY_SIZE; j+=(ARRAY_SIZE/n)) {</pre>
    for(k = 0; k < ARRAY_SIZE; k+=(ARRAY_SIZE/n)) {</pre>
         for(ii = i; ii < i+(ARRAY_SIZE/n); ii++)</pre>
           for(jj = j; jj < j+(ARRAY_SIZE/n); jj++)</pre>
             for(kk = k; kk < k+(ARRAY_SIZE/n); kk++)</pre>
               c[ii][jj] += a[ii][kk]*b[kk][jj];
```

for(i = 0; i < ARRAY\_SIZE; i+=(ARRAY\_SIZE/n)) {</pre> for(j = 0; j < ARRAY\_SIZE; j+=(ARRAY\_SIZE/n)) {</pre>

for(i = 0; i < ARRAY\_SIZE; i+=(ARRAY\_SIZE/n)) {</pre> for(j = 0; j < ARRAY\_SIZE; j+=(ARRAY\_SIZE/n)) {</pre> for(k = 0; k < ARRAY\_SIZE; k+=(ARRAY\_SIZE/n)) {</pre> for(ii = i; ii < i+(ARRAY\_SIZE/n); ii++)</pre> for(jj = j; jj < j+(ARRAY\_SIZE/n); jj++)</pre> for(kk = k; kk < k+(ARRAY\_SIZE/n); kk++)</pre> // Compute on b\_t c[ii][jj] += a[ii][kk]\*b\_t[jj][kk];

# Prefetching

### **Characteristic of memory accesses**

```
for(i = 0;i < 1000000; i++) {</pre>
     D[i] = rand();
}
```





### Prefetching

```
for(i = 0;i < 1000000; i++) {</pre>
     D[i] = rand();
     // prefetch D[i+8] if i % 8 == 0
}
```



# Prefetching

- Identify the access pattern and proactively fetch data/ instruction before the application asks for the data/instruction
  - Trigger the cache miss earlier to eliminate the miss when the application needs the data/instruction
- Hardware prefetch
  - The processor can keep track the distance between misses. If there is a pattern, fetch miss\_data\_address+distance for a miss
- Software prefetch
  - Load data into XO
  - Using prefetch instructions

### Demo

- x86 provide prefetch instructions
- As a programmer, you may insert \_mm\_prefetch in x86 programs to perform software prefetch for your code
- gcc also has a flag "-fprefetch-loop-arrays" to automatically insert software prefetch instructions

### Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers Norman P. Jouppi



# Victim cache



- evicted blocks
  - Can be built as fully associative since it's small
  - Consult when there is a miss
  - Swap the entry if hit in victim cache
  - Athlon has an 8-entry victim cache
- Reduce conflict misses
- Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

# • A small cache that captures the

### Victim cache v.s. miss caching

- Both of them improves conflict misses
- Victim cache can use cache block more efficiently swaps when miss
  - Miss caching maintains a copy of the missing data the cache block can both in L1 and miss cache
  - Victim cache only maintains a cache block when the block is kicked out
- Victim cache captures conflict miss better
  - Miss caching captures every missing block



Figure 3-3: Conflict misses removed by miss caching

Figure 3-5: Conflict misses removed by victim caching



# Advanced Hardware Techniques in Improving Memory Performance

### Without banks



### **Multibanks & non-blocking caches**





### **return block 0**xDEAEBE



# **Early Restart and Critical Word First**

- Don't wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
  - Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
- Most useful with large blocks
- Spatial locality is a problem; often we want the next sequential word soon, so not always a benefit (early restart).



• If the target "set" is not full — select an empty/invalidated block

• If the target "set is full — select a victim block using some policy

- Fetch the requesting block from lower-level memory hierarchy



- Every write to lower memory will first write to a small SRAM buffer.
  - store does not incur data hazards, but the pipeline has to stall if the write misses
  - The write buffer will continue writing data to lower-level memory
  - The processor/higher-level memory can response as soon as the data is written to write buffer.
- Write merge
  - Since application has locality, it's highly possible the evicted data have neighboring addresses. Write buffer delays the writes and allows these neighboring data to be grouped together.