Memory Optimization

Ideal Memory

Zero access time (latency)
Infinite capacity
Zero cost
Infinite bandwidth (to support multiple accesses in parallel)

The Problem

Ideal memory’s requirements oppose each other
Bigger is slower
- Bigger $\rightarrow$ Takes longer to determine the location
Faster is more expensive
- Memory technology: SRAM, DRAM, Flash, Disk, Tape
Higher bandwidth is more expensive
- Need more banks, more ports, higher frequency, or faster technology

Memory Technology: DRAM

Dynamic random access memory
Capacitor charge state indicates stored value
- Whether the capacitor is charged ir discharged indicates storage of 1 or 0
- 1 capacitor
- 1 access transistor
Capacitor leaks through the RC path
- DRAM cell loses charge over time
- DRAM cell needs to be refreshed

Memory Technology: SRAM

Static random access memory
Two cross coupled inverters store a single bit
- Feedback path enables the stored value to persist in the “cell”
- 4 transistors for storage
- 2 transistors for access

Why Memory Hierarchy?

We want both fast and large
But we cannot achieve both with a single level of memory
Idea: Have multiple levels of storage
- Progressively bigger and slower as the levels are father from the processor
- Ensure most of the data the processor needs is kept is the fast level

Fundamental tradeoff:
- Fast Memory: Small
- Large Memory: slow

Caching Basics: Exploit Temporal Locality

Idea: Store recently accessed data in automatically managed fast memory (called cache)
Anticipation: The data will be accessed again soon
Temporal locality Principle:
- Recently accessed data will be again accessed in the near future

Caching Basics: Exploit Spatial Locality

Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memory
- Logically divide memory into equal size blocks
- Fetch to cache the accessed block in its entirety
Anticipation: Nearby data will be accessed soon
Spatial locality principle:
- Nearby data in memory will be accessed in the near future
  - E.g., sequential instruction access, array traversal

Optimizing Cache Performance

Things to enhance:
- Temporal Locality
- Spatial Locality
Things to minimize:
- conflicts(i.e. bad replacement decisions)
Things We Can Manipulate:
- Time
  - When is an object accessed?
- Space
  - Where does an object exist in the address space?

Time: Reordering Computation

What makes it difficult to know when an object is accessed?
How can we predict a better time to access it?
- What information is needed?
How do we know that this would be safe?

Space: Changing Data Layout

What do we know about an object’s location?
- Scalars, structures, pointer-based data structures, arrays, code, etc.
How can we tell what a better layout would be?
- How many can we create?
To what extent can we safely alter the layout?

Iteration Space

Each position represents an iteration

Optimizing the Cache Behavior Of Array Accesses

We need to answer the following questions:
- when do cache misses occur?
  - use Locality Analysis
- Can we change the order of the iterations (or possibly data layout) to produce better behavior
  - Evaluate the cost of various alternatives
- Does the new ordering/layout still produce correct results?
  - Use dependence analysis

Loop Transformation: Loop Interchange

Loop Transformation: Cache Blocking

Locality Analysis

Definition:
- Reuse:
  - Accessing a location that has been accessed in the past
- Locality:
  - Accessing a location that is now found in the cache
Key insights:
- Locality only occurs when there is reuse
Steps:
1. Find data reuse
  - If caches were infinitely large, we would be finished
2. Determine “localized iteration space”
  - Set of inner loops where the data accessed by an iteration is expected to fit within the cache
3. Find data locality:
  - reuse $\cap$ localized iteration space $\rightarrow$ locality
Types of Data Reuse/Locality

Reuse Analysis: Representation

for i = 0 .. 2
  for j = 0 to 100
    A[i][j] = B[j][0] + B[j+1][0];

Map n loop indices into d array indices via array indexing function

Finding Temporal Reuse

Temporal reuse occurs between iterations $\bf\vec{i_1}$ and $\bf\vec{i_2}$ when:
- $\bf H\vec{i_1} + \vec{c} = H\vec{i_2} + \vec{c}$
- $\bf H(\vec{i_1} - \vec{i_2}) = \vec{0}$
Rather than worrying about individual values $\bf\vec{i_1}$ and $\bf\vec{i_2}$, we say that reuse occurs along direction $\bf \vec{r}$ vector when
- $\bf H\vec(r) = \vec{0}$
Solution: Compute the nullspace of H
Example:
Example:

Computing Spatial Reuse

Replace last row of H with zeros, creating $\bf H_s$
Find the nullspace of $\bf H_s$
Result: vector along which we access the same row
Example:

-Example:

Computing Group Reuse

Only consider “uniformly generated sets”
- Index expressions differ only by constant terms
Check whether they actually do access same cache line
Only the “leading reference” suffers the bulk of the cache misses

Localized Iteration Space

Given finite cache, when does reuse result in locality?
Localized if accesses less data than effective cache size

Computing Locality

Reuse Vector Space $\cap$ Localized Vector Space $\rightarrow$ Locality Space

If both loops are localized:
- span{(1, 0)} $\cap$ span{(1, 0), (0, 1)} $\rightarrow$ span{(1, 0)}
- i.e temporal reuse does result in temporal locality
If only the innermost loop is localized
- span{(1,0)} $\cap$ span{(0, 1)} $\rightarrow$ span{}
- i.e no temporal locality

CSCD70 Lec8

Ideal Memory

The Problem

Memory Technology: DRAM

Memory Technology: SRAM

Why Memory Hierarchy?

Caching Basics: Exploit Temporal Locality

Caching Basics: Exploit Spatial Locality

Optimizing Cache Performance

Time: Reordering Computation

Space: Changing Data Layout

Iteration Space

Optimizing the Cache Behavior Of Array Accesses

Loop Transformation: Loop Interchange

Loop Transformation: Cache Blocking

Locality Analysis

Reuse Analysis: Representation

Finding Temporal Reuse

Computing Spatial Reuse

Computing Group Reuse

Localized Iteration Space

Computing Locality

CATALOG

FEATURED TAGS