Motivation
- Most applications run at <10% of the “peak” performance
- Much of the performance is lost on a single processor
- The code running on one processor often runs at only 10-20% of the processor peak
- Most of that loss is in the memory system
- Moving data takes much longer than arithmetic and logic
- We need to look under the hood of modern processors
Uniprocessor Models
Idealized Uniprocessor Model | Slightly Less Idealized Uniprocessor Model |
Processor names variables Integers, floats, pointers, arrays, structures, etc. |
Processor names variables Integers, floats, pointers, arrays, structures, etc. These are really words, e.g., 64-bit doubles, 32-bit ints, bytes, etc |
Processor performs operations on those variables Arithmetic, logical operations, etc. |
Processor performs operations on those variables Arithmetic, logical operations, etc. Only performs these operations on values in registers |
Processor controls the order, as specified by program branches, loops, functions calls, etc |
Processor controls the order, as specified by program branches, loops, functions calls, etc |
Idealized Cost Each operation has roughly the same cost add, multiply, etc |
Idealized Cost Each operation has roughly the same cost add, multiply, etc If controls, load, and store are free |
Memory Hierarchy
The memory hierarchy of a single processor can contain multiple levels of cache, memory, and storage
Latency and Bandwidth
Latency: The time it takes to go from one point to another e.g. For example the memory access latency can be 100 ns which means it will take 100ns to retrieve one byte/word from memory
Bandwidth: The amount of data transferred in a fixed period of time. e.g. For example the memory bandwidth can be 100MB/s
Latency | Size |
1ns | KB |
10ns | MB |
100ns | GB |
10ms | TB |
10sec | PB |
Cache and Locality
Caches
Reads and writes from main memory are in contiguous blocks, i.e. cache lines
- Caches store previously accessed cache lines to hopefully reduce references to memory
- If the data required by the processor resides in cache we get a cache hit, fast!
- Data accesses that are not in the cache lead to a cache miss, slow!
Approaches to Handling Memory Latency
- Temporal locality: Reuse data that is already in cache
- Eliminate memory operations by saving values in small, fast memory (cache or registers) and reusing them (bandwidth filtering)
- Spatial locality: Operate on data that are stored closed to each other in memory and can be brought into cache together
- Take advantage of better bandwidth by getting a chunk of memory into cache (or registers and using the whole chunk)
- Matrix Multiplication revisited
- Each element is a double and is 8 bytes
- The matrix is 4K by 4K: How many elements does a row have?
- 4K
- Cache line size is 64B: How many elements can fit in a cache line
- 8
- The Cache is 256KB how many rows can the cache store
- Cache Size in Byte: 256 * 1024
- Row Size in Byte: 4 * 1024 * 8
- $\frac{256 * 1024}{4 * 1024 * 8}$ = 8
- Let assume each memory reference gives us 8 consecutive elements (one cache line)
- The cache stores 32K elements (8 rows * 4K) at a time
- Data Access Pattern for i, j, k
-
# Version 1 for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) for (int k = 0; k < n; ++k) c[i][j] += A[i][k] * B[k][j]
- If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
- 1 for C
- 512 for A
- 4096 for B
- Poor spatial locality for B
-
# Version 2 for (int i = 0; i < n; ++i) for (int k = 0; k < n; ++j) for (int j = 0; j < n; ++k) c[i][j] += A[i][k] * B[k][j]
- If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
- 512 for C
- 1 for A
- 512 for B
- Improved spatial locality
- Also temporal locality -> Reuse the element in A
Cold and Warm Starts
- Cold Cache: When the data in the cache is tale: Need to read from memory
- Warm Cache: When relevant data is in the cache.
If you run a code once vs if you run a code 10 times and average over those runs, timings might differ because of the cold cache concept
False Sharing in multicores
The entire cache line is loaded to the caches of both processors
The cache line is bounced between processors because x and y are in the same cache line and because caches have to be coherent! This is called false sharing
Parallelism in single processors
Pipelining
Dave Patterson’s Laundry example: 4 people doing laundry wash(30 min) + dry (40 min) + fold (20 min) = 90min
Latency: 90 min
Sequential Execution for 4 people doing laundry: 90 * 4 = 6 hours
Pipelined Execution: 30 + 40 * 4 + 20 = 3.5 hours
Bandwidth = loads/hour
Sequential Bandwidth: 4 / 6
Pipelined Bandwidth: 4 / 3.5
Pipelining helps bandwidth but not latency
Bandwidth limited by slowest pipeline stage