Single Processor Machines-Performance Model

CSC367 Week2

Posted by Dino on September 24, 2021

Performance Model

  • Assume just 2 levels in the hierarchy, fast and slow
  • All data initially in slow memory
    • m = number of memory elements (words) moved between fast and slow memory
    • $t_m$ = time per slow memory operation
    • f = number of arithmetic operations
    • $t_f$ = time per arithmetic operations
    • q = $\frac{f}{m}$ average number of flops per slow memory access
      • Computational Intensity: Key to algorithm efficiency
  • Minimum possible time: f * $t_f$ when all data in fast memory
  • Actual time: $f * t_f + m * t_m = f * t_f * (1 + \frac{t_m}{t_f} * \frac{1}{q})$
  • Larger q means time closer to minimum $f * t_f$
    • $\frac{t_m}{t_f}$: Machine Balance: Key to machine efficiency
    • q $\geq \frac{t_m}{t_f}$ needed to get at least half of peak speed
    • Speed is inverse of time so peak speed is $\frac{1}{f * t_f}$

Warm up: Matrix-Vector Multiplication

1
2
3
4
{implements y = y + A * x}
for i in 1:n
    for j = 1:n
        y(i) = y(i) + A(i, j) * x(j)

Expand

1
2
3
4
5
6
7
8
{implements y = y + A * x}
# read x(1:n) into fast memory -> This line needs n memory references
# read y(1:n) into fast memory -> This line need n memory references
for i in 1:n
    # read row i of A into fast memory -> n memory references
    for j = 1:n
        y(i) = y(i) + A(i, j) * x(j)
# Write y(1:n) back to slow memory -> n memory references
  • m = Number of slow memory operation refs = n + n + (n * n) + n = $n^2 + 3n$
  • f = Number of arithmetic operations = 2 * $n^2$
  • q = f / m $\approx$ 2

  • The computational intensity of 2 in matrix-vector multiply means that we can not get close to half peak of machines

  • Matrix-vector Multiplication is a memory bound operation

Naive Matrix Multiply

1
2
3
4
5
{implements C = C + A * B}
for i in 1:n
    for j = 1:n
        for k = 1:n
            C(i,j) = C(i, j) + A(i, k) * B(k, j)
  • Algorithm has 2 * $n^3$ = $O(n^2)$ flops
  • If all the data would fir in fast memory (ideal case!) we would only make 3 * $n^2$ memory references because you only need to read each matrix once from slow memory and then it sticks around in fast memory
  • q potentially as large as $\frac{2 * n^3}{3 * n^2}$ = O(n)

Expand

1
2
3
4
5
6
7
8
9
{implements C = C + A * B}
for i in 1:n
    # read row i of A into fast -> n references since it is in the loop total will be n2
    for j = 1:n
        # read C(i, j) into fast -> 1 referenes  two loops around it n2
        # read column j of B into fast memory -> n referenes  two loops around it n3
        for k = 1:n
            C(i,j) = C(i, j) + A(i, k) * B(k, j) -> 1 referenes  two loops around it n2
        # Write c(i, j) back to slow memory

m = n * (n + (n * (1 + n + 1))) = $n^3 + 3 n^2$

  • $n^3$ to read ech column of B n times
  • $n^2$ to read each row of A once
  • $2n^2$ to read and write each element of C once

q = $\frac{f}{m}$ = $\frac{2n^3}{n^3 + 3N62}$ $\approx$ 2 for large n

Blocked (Tiled) Matrix Multiply

Consider A,B,C to be N-by-N matrices of b-by-b sub-blocks where b = $\frac{n}{N}$ is the blocked size

1
2
3
4
5
6
7
8
for i = 1 to N
    for j = 1 to N
        {read block C(i,j) into fast memory}
        for k = 1 to N
            {read block A(i,k) into fast memory}
            {read block B(k,j) into fast memory}
            C(i, j) = C(i, j) + A(i, k) * B(k, j) {do a matrix multiply on blocks}
        {write back C(i, j) to slow memory}

m = $N$ * $N$ ($(\frac{n}{N})^2$ + $N$ * ($(\frac{n}{N})^2$ + $(\frac{n}{N})^2$) + $(\frac{n}{N})^2$) = $(2N + 2) n^2$

  • $N * n^2$ read each block of B $N^3$ time
  • $N * n^2$ read each block of A $N^3$ time
  • 2$n^2$ read and write each block of C once

q = $\frac{2n^3}{((2N+2) *N^2)}$ $\approx$ $\frac{n}{N}$ = b for large n

So we can improve performance by increasing the block size b

Can have a much better computational intensity than matrix-vector multiply (q = 2)

Limits to Optimizing Matrix Multiply

  • The tiled matrix multiply analysis assumes that three tiles / blocks fit into fast memory at once

  • If $M_{fast}$ is the size of fast memory then the previous analysis shows that the blocked algorithm has computational intensity: $q \approx b \leq (\frac{M_{fast}}{3})^{\frac{1}{2}}$