Parallel Architectures

Models

Shared Memory: Pthreads, OpenMP
Distributed Memory: MPI
SIMD and Vector: CUA
Hybrid: A mix of the above

Parallel Algorithm Design: Outline

Tasks: Decomposition, Task Dependency, Granularity, Interaction, Mapping, Balance
Decomposition techniques
Mapping techniques to reduce parallelism overhead
Parallel algorithm models
Parallel program performance mode

Decomposition and Tasks

Decomposition: Dividing the computation in a program into tasks that could be executed in parallel
Task:Unit of computation that can be extracted from the main program and assigned to a process, and which can be run concurrently with other tasks
The way to extract tasks, and the mapping to process affects performance

Parallel task decomposition

Tasks cna range from individual instructions to entire programs
Which one is best?
- The answer is always “it depends” .. on the specific application and the parallel platform
e.g. Matrix-vector multiplication
- Multiply 4 * 4 dense matrix A with vector b of size 4 => calculate A x b = c
- Say that computing each output item is a task (T0 - 3)
  - Consider what each task needs, and if there are data dependencies

Task dependencies

Tasks are not independent if they have dependencies on other tasks
A task might need data produced by other tasks => must wait until input is ready
Dependencies create an ordering of task execution => task dependency graph
- Directed acyclic graph (DAG): tasks as nodes, dependencies as edges
- Start nodes: no incoming edges;
- Finish nodes: no outgoing edges

Granularity

Granularity: determined by how many tasks and what their sizes are
- Coarse-grained: A small number of large tasks
- Fine-grained: A large number of small tasks -e.g. Matrix-vector multiplication
- Fine-grained: Each task = process a single element of c
- Coarse-grained: Each task = process half the elements of c
Communication between tasks may or may not be necessary
Ideal parallelism: no communication needed
Coarse-grained parallelism: Lots of computation performed before communication is necessary
- Good match for message-passing environments
Fine-grained parallelism: Frequent communication may be necessary
- More suitable for shared memory environments (Pthreads, OpenMP)
Parallelism granularity = How much processing is performed before communication is necessary between processes

Degree of concurrency

Maximum degree of concurrency = Max number of tasks that can be executed simultaneously at any given time
- Typically, less than total number of tasks, if tasks have dependencies
Average degree of concurrency = Average number of tasks that can be executed concurrently, during the program’s execution
Critical Path = The longest path between any pair of start and finish nodes
Critical path length = The sum of node weights along the critical path
Average degree of concurrency = $\frac{\text{total # tasks} }{\text{critical path length}}$

Granularity and concurrency

If granularity of decomposition is more fine-grain, more concurrent available
More concurrency => more potential tasks to run in parallel
If so, then reduce program execution time by just increasing granularity of tasks?
Not quite true
- Inherent limits to fine-grained decomposition, e.g., hitting indivisible tasks, or tasks which cause slowdown if split up
- For example if a task multiples one element of A with one element of b to store a partial value of element of C then all tasks working on the first row of A have to interact!
- More tasks => Potentially more dependencies => More overhead

Task Interactions

A task dependency graph only captures producer-consumer interactions
- A task’s output is used as another task’s input
Interactions might occur among tasks that are independent in the task dependency graph
- Tasks on different processors might need to exchange data or synchronize
Tasks may share data via task interactions
- Read-only interactions: Tasks only need to read data shared with other tasks
- Read-write interactions: Tasks can read or write data shared with other tasks
Type of sharing can affect which tasks should get mapped to which processes
- Read-write interactions should be kept on the same process as much as possible

Mapping tasks to processes

Mapping = Assigning tasks to processes
The choice of decomposition affects the ability to select a good mapping
Goals of a good mapping:
- Maximize the use of concurrency
- Minimize the total completion time
- Minimize interaction among processes
Often, the task decomposition and mapping can result in conflicting goals
- Must find a good balance to optimize for all goals
Degree of concurrency is affected by decomposition choice
Mapping affects how much of the concurrency can be efficiently utilized

Tasks Size and Balance

Task size = Proportional to time needed to complete the task
- Uniform tasks: Require roughly the same amount of time
- Non-uniform tasks: Execution times vary widely
Size of data associated with tasks = How much data does each task process
- Impacts whether the tasks are well-balanced
- Affects performance if a task’s data must be moved from a remote processor
- Input data might be small, but output data is large, or vice-verse, etc.

Decomposition Techniques

Recursive Decomposition: Primarily decompose tasks
Data decomposition: Partitions the data to induce task decomposition

Recursive Decomposition

Recursive decomposition is primarily based on task decomposition
Useful for problems that can be approached using a divide-and-conquer strategy
Divide problem into subproblems, solved subproblems by subdividing recursively the same way and combining results
Subproblems can be solved concurrently

Data Decomposition

Partition the data on which computations are performed
Use the data partitioning to perform the decomposition of computation into tasks
Used to exploit concurrency on problems that operation on large data
Data decomposition is typically performed in two stages:
- Step 1: Partition the data
- Step 2: Induce task decomposition from the partitioned data
Data partitioning comes in different flavors:
- Partition output data
- Partition input data
- Partition both input and output data
e.g. Matrix-vector example
Partition output data
1. Each element of the output can be computed independently In this case, this case, this induces a partitioning of the input matrix as well 2, Decompose the computation into tasks
- Option 1: 4 tasks, each computes 2 consecutive elements of the result
- Option 2: 2 tasks, each computes 4 consecutive elements of the result (Coarser-grained)
- Option 3: 4 tasks, each computes 2 non-consecutive elements c
  1. Output data partitioning: Typically good if parts of the output can be naturally computed as a function of the input data
Partition input data
Matrix-vector example:
- row-wise partitioning, partition b similarly
  - If each task takes one row of A and one item of b, any task dependencies?
    - If we want a task to compute one element of b, then tasks must exchange data to get all of b
- column-wise partitioning
  - Tasks don’t need to exchange data, but they have to synchronize because one element of c is computed with the help of all tasks

Mapping Techniques

Why care about mapping the task, what if we just randomly assign tasks to processors? - An efficient task mapping is critical to minimize parallel processing overheads - Load imbalance - Inter-process communication: Culprits are synchronization and data sharing

Mapping tasks to processes

Mapping goal: All tasks must complete in the shortest possible time
To do so, minimize overheads of task execution
1. Load Balancing: Minimize the time spent idle by some processes
2. Minimize the time spent in interactions among processes
The two goals can be conflicting
- To optimize 2, put interacting task on the same processor
  - Can lead to load imbalance and idling
- To optimize 1, break down tasks into fine-grained pieces, to ensure good load balance
  - Can lead to a lot more interaction overheads
Must carefully balance the two goals in the context of the problem

Static Mapping

Static Mapping: Assign tasks to process before execution starts
Static mapping allows for static load balancing
Mapping quality depends on knowledge of task sizes, size of data associated with tasks, characteristics of task interactions, and parallel programming paradigm
If task sizes not known => Can potentially lead to severe load imbalances
Usually done with static and uniform partitioning of data: Data parallel problems!
Tasks are tied to chunks of data generated by the partitioning approach
Mapping tasks to processes essentially closely tied to mapping data to processes

Dynamic Mapping

Dynamic Mapping: Assign tasks to processes during execution
Dynamic mapping allows for dynamic load balancing
- If task sizes are unknown =? Dynamic mappings are more effective than ones
- If much more data than computation => Large overheads for data movement => static may be preferable
- Depends on the parallel paradigm and interaction type though (shared address space vs distributed memory, read-only vs read-write interaction)
Common scheme for dynamic mapping
- Keep tasks in a centralized pool of tasks, assign them as processes become idle
  - The process managing the pool of ready tasks => Master process
  - Other processes performing the tasks => Worker Processes
- Tasks may get added to the pool, concurrently with the workers taking tasks out

Methods for containing interaction overheads

Maximizing data locality
Minimizing contention and hot spots
Overlapping computations with interactions
Replicating data or computations

Maximizing data locality

Processes share data and/or may require data generated by other processes
Goals:
1. Minimize the volume of interaction overheads (Minimize non-local data accesses and maximize local data utilization)
  - Minimize the volume of shared data and maximize cache reuse
  - Use a suitable decomposition and mapping
  - Use local data to store intermediate results (decreases the amount of data exchange)
2. Minimize the frequency of interactions
  - Restructure the algorithm to access and use large chunks of shared data (amortize interaction cost by reducing frequency or interactions)
  - Shared address space: Spatial locality in a cache line

Minimizing contention and hotspots

Accessing shared data concurrently can generate contention
- e.g., Concurrent accesses to the same memory block, flooding a specific process with messages, etc
Solutions:
- Restructure the program to reorder accesses i a way that does not lead to contention
- Decentralize the shared data, to eliminate the single point of contention

Overlap computations with interactions

Process may idle waiting for shared data => do useful computations while waiting
Strategies:
- Initiate an interaction earlier than necessary, so it’s ready when needed
- Grab more tasks in advance, before current task is completed
May be supported in software, or hardware
Harder to implement with shared memory models, applies more to distributed and GPU architectures

Replicate data or computations

To reduce contention for shared data, may be useful to replicate it on each process => no interaction overheads
Beneficial if the shared data is accesses in read-only fashion
- Shared-address space paradigm: cache local copies of the data
- Message-passing paradigm
Disadvantages:
- Increases overall memory usage by keeping replicas => use sparingly
- If shared data is read-write must keep the copies coherent => overheads might dwarf the benefits of local accesses via replication

Parallel Architectures

CSC367 Week3

Models

Parallel Algorithm Design: Outline

Decomposition and Tasks

Parallel task decomposition

Task dependencies

Granularity

Degree of concurrency

Granularity and concurrency

Task Interactions

Mapping tasks to processes

Tasks Size and Balance

Decomposition Techniques

Recursive Decomposition

Data Decomposition

Mapping Techniques

Mapping tasks to processes

Static Mapping

Dynamic Mapping

Methods for containing interaction overheads

Maximizing data locality

Minimizing contention and hotspots

Overlap computations with interactions

Replicate data or computations

CATALOG

FEATURED TAGS