Parallel Architectures

CSC367 Week3

Posted by Dino on October 1, 2021

Models

  • Shared Memory: Pthreads, OpenMP
  • Distributed Memory: MPI
  • SIMD and Vector: CUA
  • Hybrid: A mix of the above

Parallel Algorithm Design: Outline

  • Tasks: Decomposition, Task Dependency, Granularity, Interaction, Mapping, Balance
  • Decomposition techniques
  • Mapping techniques to reduce parallelism overhead
  • Parallel algorithm models
  • Parallel program performance mode

Decomposition and Tasks

  • Decomposition: Dividing the computation in a program into tasks that could be executed in parallel
  • Task:Unit of computation that can be extracted from the main program and assigned to a process, and which can be run concurrently with other tasks
  • The way to extract tasks, and the mapping to process affects performance

Parallel task decomposition

  • Tasks cna range from individual instructions to entire programs
  • Which one is best?
    • The answer is always “it depends” .. on the specific application and the parallel platform
  • e.g. Matrix-vector multiplication
    • Multiply 4 * 4 dense matrix A with vector b of size 4 => calculate A x b = c
    • Say that computing each output item is a task (T0 - 3)
      • Consider what each task needs, and if there are data dependencies

Task dependencies

  • Tasks are not independent if they have dependencies on other tasks
  • A task might need data produced by other tasks => must wait until input is ready
  • Dependencies create an ordering of task execution => task dependency graph
    • Directed acyclic graph (DAG): tasks as nodes, dependencies as edges
    • Start nodes: no incoming edges;
    • Finish nodes: no outgoing edges

Granularity

  • Granularity: determined by how many tasks and what their sizes are
    • Coarse-grained: A small number of large tasks
    • Fine-grained: A large number of small tasks -e.g. Matrix-vector multiplication
    • Fine-grained: Each task = process a single element of c
    • Coarse-grained: Each task = process half the elements of c
  • Communication between tasks may or may not be necessary
  • Ideal parallelism: no communication needed
  • Coarse-grained parallelism: Lots of computation performed before communication is necessary
    • Good match for message-passing environments
  • Fine-grained parallelism: Frequent communication may be necessary
    • More suitable for shared memory environments (Pthreads, OpenMP)
  • Parallelism granularity = How much processing is performed before communication is necessary between processes

Degree of concurrency

  • Maximum degree of concurrency = Max number of tasks that can be executed simultaneously at any given time
    • Typically, less than total number of tasks, if tasks have dependencies
  • Average degree of concurrency = Average number of tasks that can be executed concurrently, during the program’s execution

  • Critical Path = The longest path between any pair of start and finish nodes

  • Critical path length = The sum of node weights along the critical path

  • Average degree of concurrency = $\frac{\text{total # tasks} }{\text{critical path length}}$

Granularity and concurrency

  • If granularity of decomposition is more fine-grain, more concurrent available
  • More concurrency => more potential tasks to run in parallel
  • If so, then reduce program execution time by just increasing granularity of tasks?
  • Not quite true
    • Inherent limits to fine-grained decomposition, e.g., hitting indivisible tasks, or tasks which cause slowdown if split up
    • For example if a task multiples one element of A with one element of b to store a partial value of element of C then all tasks working on the first row of A have to interact!
    • More tasks => Potentially more dependencies => More overhead

Task Interactions

  • A task dependency graph only captures producer-consumer interactions
    • A task’s output is used as another task’s input
  • Interactions might occur among tasks that are independent in the task dependency graph
    • Tasks on different processors might need to exchange data or synchronize
  • Tasks may share data via task interactions
    • Read-only interactions: Tasks only need to read data shared with other tasks
    • Read-write interactions: Tasks can read or write data shared with other tasks
  • Type of sharing can affect which tasks should get mapped to which processes
    • Read-write interactions should be kept on the same process as much as possible

Mapping tasks to processes

  • Mapping = Assigning tasks to processes
  • The choice of decomposition affects the ability to select a good mapping
  • Goals of a good mapping:
    • Maximize the use of concurrency
    • Minimize the total completion time
    • Minimize interaction among processes
  • Often, the task decomposition and mapping can result in conflicting goals
    • Must find a good balance to optimize for all goals
  • Degree of concurrency is affected by decomposition choice
  • Mapping affects how much of the concurrency can be efficiently utilized

Tasks Size and Balance

  • Task size = Proportional to time needed to complete the task
    • Uniform tasks: Require roughly the same amount of time
    • Non-uniform tasks: Execution times vary widely
  • Size of data associated with tasks = How much data does each task process
    • Impacts whether the tasks are well-balanced
    • Affects performance if a task’s data must be moved from a remote processor
    • Input data might be small, but output data is large, or vice-verse, etc.

Decomposition Techniques

  • Recursive Decomposition: Primarily decompose tasks
  • Data decomposition: Partitions the data to induce task decomposition

Recursive Decomposition

  • Recursive decomposition is primarily based on task decomposition
  • Useful for problems that can be approached using a divide-and-conquer strategy
  • Divide problem into subproblems, solved subproblems by subdividing recursively the same way and combining results
  • Subproblems can be solved concurrently

Data Decomposition

  • Partition the data on which computations are performed
  • Use the data partitioning to perform the decomposition of computation into tasks
  • Used to exploit concurrency on problems that operation on large data
  • Data decomposition is typically performed in two stages:
    • Step 1: Partition the data
    • Step 2: Induce task decomposition from the partitioned data
  • Data partitioning comes in different flavors:
    • Partition output data
    • Partition input data
    • Partition both input and output data
  • e.g. Matrix-vector example
  • Partition output data
    1. Each element of the output can be computed independently In this case, this case, this induces a partitioning of the input matrix as well 2, Decompose the computation into tasks
    • Option 1: 4 tasks, each computes 2 consecutive elements of the result
    • Option 2: 2 tasks, each computes 4 consecutive elements of the result (Coarser-grained)
    • Option 3: 4 tasks, each computes 2 non-consecutive elements c
      1. Output data partitioning: Typically good if parts of the output can be naturally computed as a function of the input data
  • Partition input data
  • Matrix-vector example:
    • row-wise partitioning, partition b similarly
      • If each task takes one row of A and one item of b, any task dependencies?
        • If we want a task to compute one element of b, then tasks must exchange data to get all of b
    • column-wise partitioning
      • Tasks don’t need to exchange data, but they have to synchronize because one element of c is computed with the help of all tasks

Mapping Techniques

Why care about mapping the task, what if we just randomly assign tasks to processors? - An efficient task mapping is critical to minimize parallel processing overheads - Load imbalance - Inter-process communication: Culprits are synchronization and data sharing

Mapping tasks to processes

  • Mapping goal: All tasks must complete in the shortest possible time
  • To do so, minimize overheads of task execution
    1. Load Balancing: Minimize the time spent idle by some processes
    2. Minimize the time spent in interactions among processes
  • The two goals can be conflicting
    • To optimize 2, put interacting task on the same processor
      • Can lead to load imbalance and idling
    • To optimize 1, break down tasks into fine-grained pieces, to ensure good load balance
      • Can lead to a lot more interaction overheads
  • Must carefully balance the two goals in the context of the problem

Static Mapping

  • Static Mapping: Assign tasks to process before execution starts
  • Static mapping allows for static load balancing
  • Mapping quality depends on knowledge of task sizes, size of data associated with tasks, characteristics of task interactions, and parallel programming paradigm
  • If task sizes not known => Can potentially lead to severe load imbalances
  • Usually done with static and uniform partitioning of data: Data parallel problems!
  • Tasks are tied to chunks of data generated by the partitioning approach
  • Mapping tasks to processes essentially closely tied to mapping data to processes

Dynamic Mapping

  • Dynamic Mapping: Assign tasks to processes during execution
  • Dynamic mapping allows for dynamic load balancing
    • If task sizes are unknown =? Dynamic mappings are more effective than ones
    • If much more data than computation => Large overheads for data movement => static may be preferable
    • Depends on the parallel paradigm and interaction type though (shared address space vs distributed memory, read-only vs read-write interaction)
  • Common scheme for dynamic mapping
    • Keep tasks in a centralized pool of tasks, assign them as processes become idle
      • The process managing the pool of ready tasks => Master process
      • Other processes performing the tasks => Worker Processes
    • Tasks may get added to the pool, concurrently with the workers taking tasks out

Methods for containing interaction overheads

  • Maximizing data locality
  • Minimizing contention and hot spots
  • Overlapping computations with interactions
  • Replicating data or computations

Maximizing data locality

  • Processes share data and/or may require data generated by other processes
  • Goals:
    1. Minimize the volume of interaction overheads (Minimize non-local data accesses and maximize local data utilization)
      • Minimize the volume of shared data and maximize cache reuse
      • Use a suitable decomposition and mapping
      • Use local data to store intermediate results (decreases the amount of data exchange)
    2. Minimize the frequency of interactions
      • Restructure the algorithm to access and use large chunks of shared data (amortize interaction cost by reducing frequency or interactions)
      • Shared address space: Spatial locality in a cache line

Minimizing contention and hotspots

  • Accessing shared data concurrently can generate contention
    • e.g., Concurrent accesses to the same memory block, flooding a specific process with messages, etc
  • Solutions:
    • Restructure the program to reorder accesses i a way that does not lead to contention
    • Decentralize the shared data, to eliminate the single point of contention

Overlap computations with interactions

  • Process may idle waiting for shared data => do useful computations while waiting
  • Strategies:
    • Initiate an interaction earlier than necessary, so it’s ready when needed
    • Grab more tasks in advance, before current task is completed
  • May be supported in software, or hardware
  • Harder to implement with shared memory models, applies more to distributed and GPU architectures

Replicate data or computations

  • To reduce contention for shared data, may be useful to replicate it on each process => no interaction overheads
  • Beneficial if the shared data is accesses in read-only fashion
    • Shared-address space paradigm: cache local copies of the data
    • Message-passing paradigm
  • Disadvantages:
    • Increases overall memory usage by keeping replicas => use sparingly
    • If shared data is read-write must keep the copies coherent => overheads might dwarf the benefits of local accesses via replication