Parallel Algorithm Models

CSC367 Week4

Posted by Dino on October 8, 2021

Parallel Algorithm models

Model = 1 decomposition type + 1 mapping type + strategies to minimize interactions

Commonly used parallel algorithm models: - Data parallel model - Work pool model - Master slave model

Data parallel model

  • Decomposition: Typically static and uniform data partitioning
  • Mapping: Static
  • Same operations on different data items, aka data parallelism
  • Possible optimization strategies:
    • Choose a locality-preserving decomposition
    • Overlap computation with interaction
  • This model scales really well with problem size

Work pool model

  • Tasks are taken by processes from a common pool of work
  • Decomposition: Highly depends on the problem
    • Can be statically available at start, or dynamically create more tasks during execution
  • Mapping: Dynamic
    • Any task can be performed by any process
  • Possible strategies for reducing interactions:
    • Adjust granularity: Tradeoff between load imbalance and overhead of accessing work pool

Master slave model

  • Commonly used in distributed parallel architectures
  • A master process generates work and allocates to worker processes
    • could involve several masters, or a hierarchy of master processes
  • Decomposition: Highly depends on the problem
    • Might be static if tasks are easy to break down a priori, or dynamic
  • Mapping: Often dynamic
    • Any worker can be assigned any of the tasks
  • Possible strategies for reducing interactions:
    • Choose granularity carefully so that master does not become a bottleneck
    • Overlap computation on workers with master generating further tasks

Parallel program performance model

  • Parallel execution over N processes rarely results in N-fold performance gain
  • Overheads of inter-process communication, idling, and/or excess computation

Performance metrics for parallel programs

  • Execution time
  • Speedup
  • Efficiency
  • Example performance plots
  • Ways to fool the masses with performance results

Execution time

  • Execution time
    • Time elapsed from start of parallel computation and time when last process finishes
    • Determined by slowest process

Speedup

  • performance gain of the parallel algorithm over the sequential version
    • S = $\frac{T_s}{T_p}$
    • $T_s$: Time of the serial execution
    • $T_p$ Time to execute on p processors
  • Speedup is calculated relative to the best serial implementation of the algorithm.
  • Maximum achievable speedup is called linear speedup
    • Each process takes p times less than $T_s$ => S = p
  • If S > o => superlinear speedup
  • Superlinear speedup can only happen if sequential algorithm is at a disadvantage compared to parallel version
    • Data too large to fit into 1 processor’s cache => data accesses are slower for serial algorithm
  • e.g.,
    • Old Program: $T$ = $T_1$ + $T_2$
    • New Program: $T’$ = $T_1’$ + $T_2’$
    • $T_1$: Time that can not be enhanced
    • $T_2$: Time that can be enhanced
    • $T_2’$: Time after the enhancement
    • Speedup: $S_{overall} = \frac{T}{T’}$
  • Amdahl’s law
    • Suppose only part of an application is parallel
    • $T_1$ = fraction of work done sequentially
    • $T_2 = 1 - T_1$ is fraction parallelizable
    • p = number of processors
    • Speedup(P) = $\frac{T}{T’}$
    • Speedup(P) $\leq$ $\frac{1}{T_1 + \frac{1 - T_1}{p}}$ $\leq$ $\frac{1}{T_1}$
    • Even if the parallel part speeds up perfectly performance is limited by the sequential part

Efficiency

  • The fraction of time when processes doing useful work
    • $E = \frac{S}{p}$
  • What is ideal efficiency?
    • 1
  • What are the ranges of values for E?
    • 0 to 1
  • What is the efficiency of calculating the sum of n array elements on n processes?
    • E = $\frac{\Theta (\frac{n}{log n})}{n}$
    • E = $\Theta (\frac{1}{log n})$

Reporting

Examples of how to fool the masses when reporting results from your parallel program

  1. Use 64-bit for baseline/serial and 32-bit for parallel numbers:
    • Correct approach: Use the same precision for both the parallel implementation and the serial/baseline
    • The type of “cheating” in speed up reports often happens in GPU parallel programming, where single-precision is faster than double-precision computing
  2. Use a bad algorithm for the baseline:
    • Correct approach: Always optimize the serial algorithm first and use it as the baseline for speedups
  3. Use a bad implementation for the baseline:
    • Correct approach: While optimizing the parallel code if you realize you could have optimized the serial version better, go back and optimize the serial code and use that as baseline
  4. Don’t report running times at all:
    • Correct approach: Report running times as well as speedup