Single Processor Machines

Motivation

Most applications run at <10% of the “peak” performance
Much of the performance is lost on a single processor
- The code running on one processor often runs at only 10-20% of the processor peak
Most of that loss is in the memory system
- Moving data takes much longer than arithmetic and logic
We need to look under the hood of modern processors

Uniprocessor Models

Idealized Uniprocessor Model	Slightly Less Idealized Uniprocessor Model
Processor names variables Integers, floats, pointers, arrays, structures, etc.	Processor names variables Integers, floats, pointers, arrays, structures, etc. These are really words, e.g., 64-bit doubles, 32-bit ints, bytes, etc
Processor performs operations on those variables Arithmetic, logical operations, etc.	Processor performs operations on those variables Arithmetic, logical operations, etc. Only performs these operations on values in registers
Processor controls the order, as specified by program branches, loops, functions calls, etc	Processor controls the order, as specified by program branches, loops, functions calls, etc
Idealized Cost Each operation has roughly the same cost add, multiply, etc	Idealized Cost Each operation has roughly the same cost add, multiply, etc If controls, load, and store are free

Memory Hierarchy

The memory hierarchy of a single processor can contain multiple levels of cache, memory, and storage

Latency and Bandwidth

Latency: The time it takes to go from one point to another e.g. For example the memory access latency can be 100 ns which means it will take 100ns to retrieve one byte/word from memory

Bandwidth: The amount of data transferred in a fixed period of time. e.g. For example the memory bandwidth can be 100MB/s

Latency	Size
1ns	KB
10ns	MB
100ns	GB
10ms	TB
10sec	PB

Cache and Locality

Caches

Reads and writes from main memory are in contiguous blocks, i.e. cache lines

Caches store previously accessed cache lines to hopefully reduce references to memory
If the data required by the processor resides in cache we get a cache hit, fast!
Data accesses that are not in the cache lead to a cache miss, slow!

Approaches to Handling Memory Latency

Temporal locality: Reuse data that is already in cache
- Eliminate memory operations by saving values in small, fast memory (cache or registers) and reusing them (bandwidth filtering)
Spatial locality: Operate on data that are stored closed to each other in memory and can be brought into cache together
- Take advantage of better bandwidth by getting a chunk of memory into cache (or registers and using the whole chunk)
Matrix Multiplication revisited
- Each element is a double and is 8 bytes
- The matrix is 4K by 4K: How many elements does a row have?
  - 4K
- Cache line size is 64B: How many elements can fit in a cache line
  - 8
- The Cache is 256KB how many rows can the cache store
  - Cache Size in Byte: 256 * 1024
  - Row Size in Byte: 4 * 1024 * 8
  - $\frac{256 * 1024}{4 * 1024 * 8}$ = 8
Let assume each memory reference gives us 8 consecutive elements (one cache line)
The cache stores 32K elements (8 rows * 4K) at a time
- Data Access Pattern for i, j, k
- ```
# Version 1
for (int i = 0; i < n; ++i)
  for (int j = 0; j < n; ++j)
      for (int k = 0; k < n; ++k)
          c[i][j] += A[i][k] * B[k][j]
```
- If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
  - 1 for C
  - 512 for A
  - 4096 for B
  - Poor spatial locality for B
- ```
# Version 2
for (int i = 0; i < n; ++i)
  for (int k = 0; k < n; ++j)
      for (int j = 0; j < n; ++k)
          c[i][j] += A[i][k] * B[k][j]
```
- If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
  - 512 for C
  - 1 for A
  - 512 for B
  - Improved spatial locality
  - Also temporal locality -> Reuse the element in A

Cold and Warm Starts

Cold Cache: When the data in the cache is tale: Need to read from memory
Warm Cache: When relevant data is in the cache.

If you run a code once vs if you run a code 10 times and average over those runs, timings might differ because of the cold cache concept

The entire cache line is loaded to the caches of both processors

The cache line is bounced between processors because x and y are in the same cache line and because caches have to be coherent! This is called false sharing

Parallelism in single processors

Pipelining

Dave Patterson’s Laundry example: 4 people doing laundry wash(30 min) + dry (40 min) + fold (20 min) = 90min

Latency: 90 min

Sequential Execution for 4 people doing laundry: 90 * 4 = 6 hours

Pipelined Execution: 30 + 40 * 4 + 20 = 3.5 hours

Bandwidth = loads/hour

Sequential Bandwidth: 4 / 6

Pipelined Bandwidth: 4 / 3.5

Pipelining helps bandwidth but not latency

Bandwidth limited by slowest pipeline stage

Single Processor Machines

CSC367 Week1

Motivation

Uniprocessor Models

Memory Hierarchy

Latency and Bandwidth

Cache and Locality

Caches

Approaches to Handling Memory Latency

Cold and Warm Starts

Parallelism in single processors

Pipelining

SIMD: SIngle Instruction, Multiple Data

CATALOG

FEATURED TAGS

Motivation

Uniprocessor Models

Memory Hierarchy

Latency and Bandwidth

Cache and Locality

Caches

Approaches to Handling Memory Latency

Cold and Warm Starts

False Sharing in multicores

Parallelism in single processors

Pipelining

SIMD: SIngle Instruction, Multiple Data

CATALOG

FEATURED TAGS