Single Processor Machines

CSC367 Week1

Posted by Dino on September 17, 2021

Motivation

  • Most applications run at <10% of the “peak” performance
  • Much of the performance is lost on a single processor
    • The code running on one processor often runs at only 10-20% of the processor peak
  • Most of that loss is in the memory system
    • Moving data takes much longer than arithmetic and logic
  • We need to look under the hood of modern processors

Uniprocessor Models

Idealized Uniprocessor Model Slightly Less Idealized Uniprocessor Model
Processor names variables
Integers, floats, pointers, arrays, structures, etc.
Processor names variables
Integers, floats, pointers, arrays, structures, etc.
These are really words, e.g., 64-bit doubles, 32-bit ints, bytes, etc
Processor performs operations on those variables
Arithmetic, logical operations, etc.
Processor performs operations on those variables
Arithmetic, logical operations, etc.
Only performs these operations on values in registers
Processor controls the order, as specified by program
branches, loops, functions calls, etc
Processor controls the order, as specified by program
branches, loops, functions calls, etc
Idealized Cost
Each operation has roughly the same cost
add, multiply, etc
Idealized Cost
Each operation has roughly the same cost
add, multiply, etc If controls, load, and store are free

Memory Hierarchy

The memory hierarchy of a single processor can contain multiple levels of cache, memory, and storage

Latency and Bandwidth

Latency: The time it takes to go from one point to another e.g. For example the memory access latency can be 100 ns which means it will take 100ns to retrieve one byte/word from memory

Bandwidth: The amount of data transferred in a fixed period of time. e.g. For example the memory bandwidth can be 100MB/s

Latency Size
1ns KB
10ns MB
100ns GB
10ms TB
10sec PB

Cache and Locality

Caches

Reads and writes from main memory are in contiguous blocks, i.e. cache lines

  • Caches store previously accessed cache lines to hopefully reduce references to memory
  • If the data required by the processor resides in cache we get a cache hit, fast!
  • Data accesses that are not in the cache lead to a cache miss, slow!

Approaches to Handling Memory Latency

  • Temporal locality: Reuse data that is already in cache
    • Eliminate memory operations by saving values in small, fast memory (cache or registers) and reusing them (bandwidth filtering)
  • Spatial locality: Operate on data that are stored closed to each other in memory and can be brought into cache together
    • Take advantage of better bandwidth by getting a chunk of memory into cache (or registers and using the whole chunk)
  • Matrix Multiplication revisited
    • Each element is a double and is 8 bytes
    • The matrix is 4K by 4K: How many elements does a row have?
      • 4K
    • Cache line size is 64B: How many elements can fit in a cache line
      • 8
    • The Cache is 256KB how many rows can the cache store
      • Cache Size in Byte: 256 * 1024
      • Row Size in Byte: 4 * 1024 * 8
      • $\frac{256 * 1024}{4 * 1024 * 8}$ = 8
  • Let assume each memory reference gives us 8 consecutive elements (one cache line)
  • The cache stores 32K elements (8 rows * 4K) at a time
    • Data Access Pattern for i, j, k
    • # Version 1
      for (int i = 0; i < n; ++i)
        for (int j = 0; j < n; ++j)
            for (int k = 0; k < n; ++k)
                c[i][j] += A[i][k] * B[k][j]
      
    • If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
      • 1 for C
      • 512 for A
      • 4096 for B
      • Poor spatial locality for B
    • # Version 2
      for (int i = 0; i < n; ++i)
        for (int k = 0; k < n; ++j)
            for (int j = 0; j < n; ++k)
                c[i][j] += A[i][k] * B[k][j]
      
    • If each memory reference gave us 8 consecutive elements, how many memory reference are needed to complete the 8K float point operations?
      • 512 for C
      • 1 for A
      • 512 for B
      • Improved spatial locality
      • Also temporal locality -> Reuse the element in A

Cold and Warm Starts

  • Cold Cache: When the data in the cache is tale: Need to read from memory
  • Warm Cache: When relevant data is in the cache.

If you run a code once vs if you run a code 10 times and average over those runs, timings might differ because of the cold cache concept

False Sharing in multicores

The entire cache line is loaded to the caches of both processors

The cache line is bounced between processors because x and y are in the same cache line and because caches have to be coherent! This is called false sharing

Parallelism in single processors

Pipelining

Dave Patterson’s Laundry example: 4 people doing laundry wash(30 min) + dry (40 min) + fold (20 min) = 90min

Latency: 90 min

Sequential Execution for 4 people doing laundry: 90 * 4 = 6 hours

Pipelined Execution: 30 + 40 * 4 + 20 = 3.5 hours

Bandwidth = loads/hour

Sequential Bandwidth: 4 / 6

Pipelined Bandwidth: 4 / 3.5

Pipelining helps bandwidth but not latency

Bandwidth limited by slowest pipeline stage

SIMD: SIngle Instruction, Multiple Data