Parallel Algorithm Models

Parallel Algorithm models

Model = 1 decomposition type + 1 mapping type + strategies to minimize interactions

Commonly used parallel algorithm models: - Data parallel model - Work pool model - Master slave model

Decomposition: Typically static and uniform data partitioning
Mapping: Static
Same operations on different data items, aka data parallelism
Possible optimization strategies:
- Choose a locality-preserving decomposition
- Overlap computation with interaction
This model scales really well with problem size

Tasks are taken by processes from a common pool of work
Decomposition: Highly depends on the problem
- Can be statically available at start, or dynamically create more tasks during execution
Mapping: Dynamic
- Any task can be performed by any process
Possible strategies for reducing interactions:
- Adjust granularity: Tradeoff between load imbalance and overhead of accessing work pool

Commonly used in distributed parallel architectures
A master process generates work and allocates to worker processes
- could involve several masters, or a hierarchy of master processes
Decomposition: Highly depends on the problem
- Might be static if tasks are easy to break down a priori, or dynamic
Mapping: Often dynamic
- Any worker can be assigned any of the tasks
Possible strategies for reducing interactions:
- Choose granularity carefully so that master does not become a bottleneck
- Overlap computation on workers with master generating further tasks

Execution time
- Time elapsed from start of parallel computation and time when last process finishes
- Determined by slowest process

performance gain of the parallel algorithm over the sequential version
- S = $\frac{T_s}{T_p}$
- $T_s$: Time of the serial execution
- $T_p$ Time to execute on p processors
Speedup is calculated relative to the best serial implementation of the algorithm.
Maximum achievable speedup is called linear speedup
- Each process takes p times less than $T_s$ => S = p
If S > o => superlinear speedup
Superlinear speedup can only happen if sequential algorithm is at a disadvantage compared to parallel version
- Data too large to fit into 1 processor’s cache => data accesses are slower for serial algorithm
e.g.,
- Old Program: $T$ = $T_1$ + $T_2$
- New Program: $T’$ = $T_1’$ + $T_2’$
- $T_1$: Time that can not be enhanced
- $T_2$: Time that can be enhanced
- $T_2’$: Time after the enhancement
- Speedup: $S_{overall} = \frac{T}{T’}$
Amdahl’s law
- Suppose only part of an application is parallel
- $T_1$ = fraction of work done sequentially
- $T_2 = 1 - T_1$ is fraction parallelizable
- p = number of processors
- Speedup(P) = $\frac{T}{T’}$
- Speedup(P) $\leq$ $\frac{1}{T_1 + \frac{1 - T_1}{p}}$ $\leq$ $\frac{1}{T_1}$
- Even if the parallel part speeds up perfectly performance is limited by the sequential part

The fraction of time when processes doing useful work
- $E = \frac{S}{p}$
What is ideal efficiency?
- 1
What are the ranges of values for E?
- 0 to 1
What is the efficiency of calculating the sum of n array elements on n processes?
- E = $\frac{\Theta (\frac{n}{log n})}{n}$
- E = $\Theta (\frac{1}{log n})$

Examples of how to fool the masses when reporting results from your parallel program

Use 64-bit for baseline/serial and 32-bit for parallel numbers:
- Correct approach: Use the same precision for both the parallel implementation and the serial/baseline
- The type of “cheating” in speed up reports often happens in GPU parallel programming, where single-precision is faster than double-precision computing
Use a bad algorithm for the baseline:
- Correct approach: Always optimize the serial algorithm first and use it as the baseline for speedups
Use a bad implementation for the baseline:
- Correct approach: While optimizing the parallel code if you realize you could have optimized the serial version better, go back and optimize the serial code and use that as baseline
Don’t report running times at all:
- Correct approach: Report running times as well as speedup