Performance Model
- Assume just 2 levels in the hierarchy, fast and slow
- All data initially in slow memory
- m = number of memory elements (words) moved between fast and slow memory
- $t_m$ = time per slow memory operation
- f = number of arithmetic operations
- $t_f$ = time per arithmetic operations
- q = $\frac{f}{m}$ average number of flops per slow memory access
- Computational Intensity: Key to algorithm efficiency
- Minimum possible time: f * $t_f$ when all data in fast memory
- Actual time: $f * t_f + m * t_m = f * t_f * (1 + \frac{t_m}{t_f} * \frac{1}{q})$
- Larger q means time closer to minimum $f * t_f$
- $\frac{t_m}{t_f}$: Machine Balance: Key to machine efficiency
- q $\geq \frac{t_m}{t_f}$ needed to get at least half of peak speed
- Speed is inverse of time so peak speed is $\frac{1}{f * t_f}$
Warm up: Matrix-Vector Multiplication
1
2
3
4
{implements y = y + A * x}
for i in 1:n
for j = 1:n
y(i) = y(i) + A(i, j) * x(j)
Expand
1
2
3
4
5
6
7
8
{implements y = y + A * x}
# read x(1:n) into fast memory -> This line needs n memory references
# read y(1:n) into fast memory -> This line need n memory references
for i in 1:n
# read row i of A into fast memory -> n memory references
for j = 1:n
y(i) = y(i) + A(i, j) * x(j)
# Write y(1:n) back to slow memory -> n memory references
- m = Number of slow memory operation refs = n + n + (n * n) + n = $n^2 + 3n$
- f = Number of arithmetic operations = 2 * $n^2$
-
q = f / m $\approx$ 2
-
The computational intensity of 2 in matrix-vector multiply means that we can not get close to half peak of machines
- Matrix-vector Multiplication is a memory bound operation
Naive Matrix Multiply
1
2
3
4
5
{implements C = C + A * B}
for i in 1:n
for j = 1:n
for k = 1:n
C(i,j) = C(i, j) + A(i, k) * B(k, j)
- Algorithm has 2 * $n^3$ = $O(n^2)$ flops
- If all the data would fir in fast memory (ideal case!) we would only make 3 * $n^2$ memory references because you only need to read each matrix once from slow memory and then it sticks around in fast memory
- q potentially as large as $\frac{2 * n^3}{3 * n^2}$ = O(n)
Expand
1
2
3
4
5
6
7
8
9
{implements C = C + A * B}
for i in 1:n
# read row i of A into fast -> n references since it is in the loop total will be n2
for j = 1:n
# read C(i, j) into fast -> 1 referenes two loops around it n2
# read column j of B into fast memory -> n referenes two loops around it n3
for k = 1:n
C(i,j) = C(i, j) + A(i, k) * B(k, j) -> 1 referenes two loops around it n2
# Write c(i, j) back to slow memory
m = n * (n + (n * (1 + n + 1))) = $n^3 + 3 n^2$
- $n^3$ to read ech column of B n times
- $n^2$ to read each row of A once
- $2n^2$ to read and write each element of C once
q = $\frac{f}{m}$ = $\frac{2n^3}{n^3 + 3N62}$ $\approx$ 2 for large n
Blocked (Tiled) Matrix Multiply
Consider A,B,C to be N-by-N matrices of b-by-b sub-blocks where b = $\frac{n}{N}$ is the blocked size
1
2
3
4
5
6
7
8
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i, j) = C(i, j) + A(i, k) * B(k, j) {do a matrix multiply on blocks}
{write back C(i, j) to slow memory}
m = $N$ * $N$ ($(\frac{n}{N})^2$ + $N$ * ($(\frac{n}{N})^2$ + $(\frac{n}{N})^2$) + $(\frac{n}{N})^2$) = $(2N + 2) n^2$
- $N * n^2$ read each block of B $N^3$ time
- $N * n^2$ read each block of A $N^3$ time
- 2$n^2$ read and write each block of C once
q = $\frac{2n^3}{((2N+2) *N^2)}$ $\approx$ $\frac{n}{N}$ = b for large n
So we can improve performance by increasing the block size b
Can have a much better computational intensity than matrix-vector multiply (q = 2)
Limits to Optimizing Matrix Multiply
-
The tiled matrix multiply analysis assumes that three tiles / blocks fit into fast memory at once
-
If $M_{fast}$ is the size of fast memory then the previous analysis shows that the blocked algorithm has computational intensity: $q \approx b \leq (\frac{M_{fast}}{3})^{\frac{1}{2}}$