Models
- Shared Memory: Pthreads, OpenMP
- Distributed Memory: MPI
- SIMD and Vector: CUA
- Hybrid: A mix of the above
Parallel Algorithm Design: Outline
- Tasks: Decomposition, Task Dependency, Granularity, Interaction, Mapping, Balance
- Decomposition techniques
- Mapping techniques to reduce parallelism overhead
- Parallel algorithm models
- Parallel program performance mode
Decomposition and Tasks
- Decomposition: Dividing the computation in a program into tasks that could be executed in parallel
- Task:Unit of computation that can be extracted from the main program and assigned to a process, and which can be run concurrently with other tasks
- The way to extract tasks, and the mapping to process affects performance
Parallel task decomposition
- Tasks cna range from individual instructions to entire programs
- Which one is best?
- The answer is always “it depends” .. on the specific application and the parallel platform
- e.g. Matrix-vector multiplication
- Multiply 4 * 4 dense matrix A with vector b of size 4 => calculate A x b = c
- Say that computing each output item is a task (T0 - 3)
- Consider what each task needs, and if there are data dependencies
Task dependencies
- Tasks are not independent if they have dependencies on other tasks
- A task might need data produced by other tasks => must wait until input is ready
- Dependencies create an ordering of task execution => task dependency graph
- Directed acyclic graph (DAG): tasks as nodes, dependencies as edges
- Start nodes: no incoming edges;
- Finish nodes: no outgoing edges
Granularity
- Granularity: determined by how many tasks and what their sizes are
- Coarse-grained: A small number of large tasks
- Fine-grained: A large number of small tasks -e.g. Matrix-vector multiplication
- Fine-grained: Each task = process a single element of c
- Coarse-grained: Each task = process half the elements of c
- Communication between tasks may or may not be necessary
- Ideal parallelism: no communication needed
- Coarse-grained parallelism: Lots of computation performed before communication is necessary
- Good match for message-passing environments
- Fine-grained parallelism: Frequent communication may be necessary
- More suitable for shared memory environments (Pthreads, OpenMP)
- Parallelism granularity = How much processing is performed before communication is necessary between processes
Degree of concurrency
- Maximum degree of concurrency = Max number of tasks that can be executed simultaneously at any given time
- Typically, less than total number of tasks, if tasks have dependencies
-
Average degree of concurrency = Average number of tasks that can be executed concurrently, during the program’s execution
-
Critical Path = The longest path between any pair of start and finish nodes
-
Critical path length = The sum of node weights along the critical path
- Average degree of concurrency = $\frac{\text{total # tasks} }{\text{critical path length}}$
Granularity and concurrency
- If granularity of decomposition is more fine-grain, more concurrent available
- More concurrency => more potential tasks to run in parallel
- If so, then reduce program execution time by just increasing granularity of tasks?
- Not quite true
- Inherent limits to fine-grained decomposition, e.g., hitting indivisible tasks, or tasks which cause slowdown if split up
- For example if a task multiples one element of A with one element of b to store a partial value of element of C then all tasks working on the first row of A have to interact!
- More tasks => Potentially more dependencies => More overhead
Task Interactions
- A task dependency graph only captures producer-consumer interactions
- A task’s output is used as another task’s input
- Interactions might occur among tasks that are independent in the task dependency graph
- Tasks on different processors might need to exchange data or synchronize
- Tasks may share data via task interactions
- Read-only interactions: Tasks only need to read data shared with other tasks
- Read-write interactions: Tasks can read or write data shared with other tasks
- Type of sharing can affect which tasks should get mapped to which processes
- Read-write interactions should be kept on the same process as much as possible
Mapping tasks to processes
- Mapping = Assigning tasks to processes
- The choice of decomposition affects the ability to select a good mapping
- Goals of a good mapping:
- Maximize the use of concurrency
- Minimize the total completion time
- Minimize interaction among processes
- Often, the task decomposition and mapping can result in conflicting goals
- Must find a good balance to optimize for all goals
- Degree of concurrency is affected by decomposition choice
- Mapping affects how much of the concurrency can be efficiently utilized
Tasks Size and Balance
- Task size = Proportional to time needed to complete the task
- Uniform tasks: Require roughly the same amount of time
- Non-uniform tasks: Execution times vary widely
- Size of data associated with tasks = How much data does each task process
- Impacts whether the tasks are well-balanced
- Affects performance if a task’s data must be moved from a remote processor
- Input data might be small, but output data is large, or vice-verse, etc.
Decomposition Techniques
- Recursive Decomposition: Primarily decompose tasks
- Data decomposition: Partitions the data to induce task decomposition
Recursive Decomposition
- Recursive decomposition is primarily based on task decomposition
- Useful for problems that can be approached using a divide-and-conquer strategy
- Divide problem into subproblems, solved subproblems by subdividing recursively the same way and combining results
- Subproblems can be solved concurrently
Data Decomposition
- Partition the data on which computations are performed
- Use the data partitioning to perform the decomposition of computation into tasks
- Used to exploit concurrency on problems that operation on large data
- Data decomposition is typically performed in two stages:
- Step 1: Partition the data
- Step 2: Induce task decomposition from the partitioned data
- Data partitioning comes in different flavors:
- Partition output data
- Partition input data
- Partition both input and output data
- e.g. Matrix-vector example
- Partition output data
- Each element of the output can be computed independently In this case, this case, this induces a partitioning of the input matrix as well 2, Decompose the computation into tasks
- Option 1: 4 tasks, each computes 2 consecutive elements of the result
- Option 2: 2 tasks, each computes 4 consecutive elements of the result (Coarser-grained)
- Option 3: 4 tasks, each computes 2 non-consecutive elements c
- Output data partitioning: Typically good if parts of the output can be naturally computed as a function of the input data
- Partition input data
- Matrix-vector example:
- row-wise partitioning, partition b similarly
- If each task takes one row of A and one item of b, any task dependencies?
- If we want a task to compute one element of b, then tasks must exchange data to get all of b
- If each task takes one row of A and one item of b, any task dependencies?
- column-wise partitioning
- Tasks don’t need to exchange data, but they have to synchronize because one element of c is computed with the help of all tasks
- row-wise partitioning, partition b similarly
Mapping Techniques
Why care about mapping the task, what if we just randomly assign tasks to processors? - An efficient task mapping is critical to minimize parallel processing overheads - Load imbalance - Inter-process communication: Culprits are synchronization and data sharing
Mapping tasks to processes
- Mapping goal: All tasks must complete in the shortest possible time
- To do so, minimize overheads of task execution
- Load Balancing: Minimize the time spent idle by some processes
- Minimize the time spent in interactions among processes
- The two goals can be conflicting
- To optimize 2, put interacting task on the same processor
- Can lead to load imbalance and idling
- To optimize 1, break down tasks into fine-grained pieces, to ensure good load balance
- Can lead to a lot more interaction overheads
- To optimize 2, put interacting task on the same processor
- Must carefully balance the two goals in the context of the problem
Static Mapping
- Static Mapping: Assign tasks to process before execution starts
- Static mapping allows for static load balancing
- Mapping quality depends on knowledge of task sizes, size of data associated with tasks, characteristics of task interactions, and parallel programming paradigm
- If task sizes not known => Can potentially lead to severe load imbalances
- Usually done with static and uniform partitioning of data: Data parallel problems!
- Tasks are tied to chunks of data generated by the partitioning approach
- Mapping tasks to processes essentially closely tied to mapping data to processes
Dynamic Mapping
- Dynamic Mapping: Assign tasks to processes during execution
- Dynamic mapping allows for dynamic load balancing
- If task sizes are unknown =? Dynamic mappings are more effective than ones
- If much more data than computation => Large overheads for data movement => static may be preferable
- Depends on the parallel paradigm and interaction type though (shared address space vs distributed memory, read-only vs read-write interaction)
- Common scheme for dynamic mapping
- Keep tasks in a centralized pool of tasks, assign them as processes become idle
- The process managing the pool of ready tasks => Master process
- Other processes performing the tasks => Worker Processes
- Tasks may get added to the pool, concurrently with the workers taking tasks out
- Keep tasks in a centralized pool of tasks, assign them as processes become idle
Methods for containing interaction overheads
- Maximizing data locality
- Minimizing contention and hot spots
- Overlapping computations with interactions
- Replicating data or computations
Maximizing data locality
- Processes share data and/or may require data generated by other processes
- Goals:
- Minimize the volume of interaction overheads (Minimize non-local data accesses and maximize local data utilization)
- Minimize the volume of shared data and maximize cache reuse
- Use a suitable decomposition and mapping
- Use local data to store intermediate results (decreases the amount of data exchange)
- Minimize the frequency of interactions
- Restructure the algorithm to access and use large chunks of shared data (amortize interaction cost by reducing frequency or interactions)
- Shared address space: Spatial locality in a cache line
- Minimize the volume of interaction overheads (Minimize non-local data accesses and maximize local data utilization)
Minimizing contention and hotspots
- Accessing shared data concurrently can generate contention
- e.g., Concurrent accesses to the same memory block, flooding a specific process with messages, etc
- Solutions:
- Restructure the program to reorder accesses i a way that does not lead to contention
- Decentralize the shared data, to eliminate the single point of contention
Overlap computations with interactions
- Process may idle waiting for shared data => do useful computations while waiting
- Strategies:
- Initiate an interaction earlier than necessary, so it’s ready when needed
- Grab more tasks in advance, before current task is completed
- May be supported in software, or hardware
- Harder to implement with shared memory models, applies more to distributed and GPU architectures
Replicate data or computations
- To reduce contention for shared data, may be useful to replicate it on each process => no interaction overheads
- Beneficial if the shared data is accesses in read-only fashion
- Shared-address space paradigm: cache local copies of the data
- Message-passing paradigm
- Disadvantages:
- Increases overall memory usage by keeping replicas => use sparingly
- If shared data is read-write must keep the copies coherent => overheads might dwarf the benefits of local accesses via replication