Parallel Algorithm models
Model = 1 decomposition type + 1 mapping type + strategies to minimize interactions
Commonly used parallel algorithm models: - Data parallel model - Work pool model - Master slave model
Data parallel model
- Decomposition: Typically static and uniform data partitioning
- Mapping: Static
- Same operations on different data items, aka data parallelism
- Possible optimization strategies:
- Choose a locality-preserving decomposition
- Overlap computation with interaction
- This model scales really well with problem size
Work pool model
- Tasks are taken by processes from a common pool of work
- Decomposition: Highly depends on the problem
- Can be statically available at start, or dynamically create more tasks during execution
- Mapping: Dynamic
- Any task can be performed by any process
- Possible strategies for reducing interactions:
- Adjust granularity: Tradeoff between load imbalance and overhead of accessing work pool
Master slave model
- Commonly used in distributed parallel architectures
- A master process generates work and allocates to worker processes
- could involve several masters, or a hierarchy of master processes
- Decomposition: Highly depends on the problem
- Might be static if tasks are easy to break down a priori, or dynamic
- Mapping: Often dynamic
- Any worker can be assigned any of the tasks
- Possible strategies for reducing interactions:
- Choose granularity carefully so that master does not become a bottleneck
- Overlap computation on workers with master generating further tasks
Parallel program performance model
- Parallel execution over N processes rarely results in N-fold performance gain
- Overheads of inter-process communication, idling, and/or excess computation
Performance metrics for parallel programs
- Execution time
- Speedup
- Efficiency
- Example performance plots
- Ways to fool the masses with performance results
Execution time
- Execution time
- Time elapsed from start of parallel computation and time when last process finishes
- Determined by slowest process
Speedup
- performance gain of the parallel algorithm over the sequential version
- S = $\frac{T_s}{T_p}$
- $T_s$: Time of the serial execution
- $T_p$ Time to execute on p processors
- Speedup is calculated relative to the best serial implementation of the algorithm.
- Maximum achievable speedup is called linear speedup
- Each process takes p times less than $T_s$ => S = p
- If S > o => superlinear speedup
- Superlinear speedup can only happen if sequential algorithm is at a disadvantage compared to parallel version
- Data too large to fit into 1 processor’s cache => data accesses are slower for serial algorithm
- e.g.,
- Old Program: $T$ = $T_1$ + $T_2$
- New Program: $T’$ = $T_1’$ + $T_2’$
- $T_1$: Time that can not be enhanced
- $T_2$: Time that can be enhanced
- $T_2’$: Time after the enhancement
- Speedup: $S_{overall} = \frac{T}{T’}$
- Amdahl’s law
- Suppose only part of an application is parallel
- $T_1$ = fraction of work done sequentially
- $T_2 = 1 - T_1$ is fraction parallelizable
- p = number of processors
- Speedup(P) = $\frac{T}{T’}$
- Speedup(P) $\leq$ $\frac{1}{T_1 + \frac{1 - T_1}{p}}$ $\leq$ $\frac{1}{T_1}$
- Even if the parallel part speeds up perfectly performance is limited by the sequential part
Efficiency
- The fraction of time when processes doing useful work
- $E = \frac{S}{p}$
- What is ideal efficiency?
- 1
- What are the ranges of values for E?
- 0 to 1
- What is the efficiency of calculating the sum of n array elements on n processes?
- E = $\frac{\Theta (\frac{n}{log n})}{n}$
- E = $\Theta (\frac{1}{log n})$
Reporting
Examples of how to fool the masses when reporting results from your parallel program
- Use 64-bit for baseline/serial and 32-bit for parallel numbers:
- Correct approach: Use the same precision for both the parallel implementation and the serial/baseline
- The type of “cheating” in speed up reports often happens in GPU parallel programming, where single-precision is faster than double-precision computing
- Use a bad algorithm for the baseline:
- Correct approach: Always optimize the serial algorithm first and use it as the baseline for speedups
- Use a bad implementation for the baseline:
- Correct approach: While optimizing the parallel code if you realize you could have optimized the serial version better, go back and optimize the serial code and use that as baseline
- Don’t report running times at all:
- Correct approach: Report running times as well as speedup