Single Processor Machines-Performance Model

Performance Model

Assume just 2 levels in the hierarchy, fast and slow
All data initially in slow memory
- m = number of memory elements (words) moved between fast and slow memory
- $t_m$ = time per slow memory operation
- f = number of arithmetic operations
- $t_f$ = time per arithmetic operations
- q = $\frac{f}{m}$ average number of flops per slow memory access
  - Computational Intensity: Key to algorithm efficiency
Minimum possible time: f * $t_f$ when all data in fast memory
Actual time: $f * t_f + m * t_m = f * t_f * (1 + \frac{t_m}{t_f} * \frac{1}{q})$
Larger q means time closer to minimum $f * t_f$
- $\frac{t_m}{t_f}$: Machine Balance: Key to machine efficiency
- q $\geq \frac{t_m}{t_f}$ needed to get at least half of peak speed
- Speed is inverse of time so peak speed is $\frac{1}{f * t_f}$

Warm up: Matrix-Vector Multiplication

{implements y = y + A * x}
for i in 1:n
    for j = 1:n
        y(i) = y(i) + A(i, j) * x(j)

Expand

{implements y = y + A * x}
# read x(1:n) into fast memory -> This line needs n memory references
# read y(1:n) into fast memory -> This line need n memory references
for i in 1:n
    # read row i of A into fast memory -> n memory references
    for j = 1:n
        y(i) = y(i) + A(i, j) * x(j)
# Write y(1:n) back to slow memory -> n memory references

m = Number of slow memory operation refs = n + n + (n * n) + n = $n^2 + 3n$
f = Number of arithmetic operations = 2 * $n^2$
q = f / m $\approx$ 2
The computational intensity of 2 in matrix-vector multiply means that we can not get close to half peak of machines
Matrix-vector Multiplication is a memory bound operation

Naive Matrix Multiply

{implements C = C + A * B}
for i in 1:n
    for j = 1:n
        for k = 1:n
            C(i,j) = C(i, j) + A(i, k) * B(k, j)

Algorithm has 2 * $n^3$ = $O(n^2)$ flops
If all the data would fir in fast memory (ideal case!) we would only make 3 * $n^2$ memory references because you only need to read each matrix once from slow memory and then it sticks around in fast memory
q potentially as large as $\frac{2 * n^3}{3 * n^2}$ = O(n)

Expand

{implements C = C + A * B}
for i in 1:n
    # read row i of A into fast -> n references since it is in the loop total will be n2
    for j = 1:n
        # read C(i, j) into fast -> 1 referenes  two loops around it n2
        # read column j of B into fast memory -> n referenes  two loops around it n3
        for k = 1:n
            C(i,j) = C(i, j) + A(i, k) * B(k, j) -> 1 referenes  two loops around it n2
        # Write c(i, j) back to slow memory

m = n * (n + (n * (1 + n + 1))) = $n^3 + 3 n^2$

$n^3$ to read ech column of B n times
$n^2$ to read each row of A once
$2n^2$ to read and write each element of C once

q = $\frac{f}{m}$ = $\frac{2n^3}{n^3 + 3N62}$ $\approx$ 2 for large n

Blocked (Tiled) Matrix Multiply

Consider A,B,C to be N-by-N matrices of b-by-b sub-blocks where b = $\frac{n}{N}$ is the blocked size

for i = 1 to N
    for j = 1 to N
        {read block C(i,j) into fast memory}
        for k = 1 to N
            {read block A(i,k) into fast memory}
            {read block B(k,j) into fast memory}
            C(i, j) = C(i, j) + A(i, k) * B(k, j) {do a matrix multiply on blocks}
        {write back C(i, j) to slow memory}

m = $N$ * $N$ ($(\frac{n}{N})^2$ + $N$ * ($(\frac{n}{N})^2$ + $(\frac{n}{N})^2$) + $(\frac{n}{N})^2$) = $(2N + 2) n^2$

$N * n^2$ read each block of B $N^3$ time
$N * n^2$ read each block of A $N^3$ time
2$n^2$ read and write each block of C once

q = $\frac{2n^3}{((2N+2) *N^2)}$ $\approx$ $\frac{n}{N}$ = b for large n

So we can improve performance by increasing the block size b

Can have a much better computational intensity than matrix-vector multiply (q = 2)

Limits to Optimizing Matrix Multiply

The tiled matrix multiply analysis assumes that three tiles / blocks fit into fast memory at once
If $M_{fast}$ is the size of fast memory then the previous analysis shows that the blocked algorithm has computational intensity: $q \approx b \leq (\frac{M_{fast}}{3})^{\frac{1}{2}}$

Single Processor Machines-Performance Model

CSC367 Week2

Performance Model

Warm up: Matrix-Vector Multiplication

Naive Matrix Multiply

Blocked (Tiled) Matrix Multiply

Limits to Optimizing Matrix Multiply

CATALOG

FEATURED TAGS