# Mapping skeleton workflows onto heterogeneous platforms

### Anne Benoit and Yves Robert

GRAAL team. LIP École Normale Supérieure de Lyon

July 2007



# Mapping applications onto parallel platforms

- Structured programming approach
  - Easier to program (deadlocks, process starvation)

  - Algorithmic skeleton: help for mapping



### Introduction and motivation

- Mapping applications onto parallel platforms Difficult challenge
- Structured programming approach
  - Easier to program (deadlocks, process starvation)

  - Algorithmic skeleton: help for mapping



### Introduction and motivation

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
- Structured programming approach
  - Easier to program (deadlocks, process starvation)

  - Algorithmic skeleton: help for mapping



### Introduction and motivation

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)

  - Algorithmic skeleton: help for mapping



### Introduction and motivation

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach

  - Algorithmic skeleton: help for mapping



POP'07

- Mapping applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms



### Introduction and motivation

- Mapping applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

Mapping skeletons (pipeline, fork) onto heterogeneous platforms



# Rule of the game

- Map each pipeline stage on a single processor (extended later)
- Goal: minimize execution time (extended later)
- Several mapping strategies

$$S_1$$
  $S_2$   $S_k$   $S_n$ 

# Rule of the game

- Map each pipeline stage on a single processor (extended later)
- Goal: minimize execution time (extended later)
- Several mapping strategies



The pipeline application

### Rule of the game

- Map each pipeline stage on a single processor (extended later)
- Goal: minimize execution time (extended later)
- Several mapping strategies



One-to-one Mapping

- Map each pipeline stage on a single processor (extended later)
- Goal: minimize execution time (extended later)
- Several mapping strategies



- Map each pipeline stage on a single processor (extended later)
- Goal: minimize execution time (extended later)
- Several mapping strategies

Example



### Major contributions

Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution



Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution

Practice Heuristics for INTERVAL MAPPING on clusters Experiments to compare heuristics and evaluate their absolute performance

### Outline

- Framework
- Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria
- Conclusion



### Outline

- Framework
- Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria
- Conclusion



- n stages  $S_k$ , 1 < k < n
- $\circ S_k$ :
  - receives input of size  $\delta_{k-1}$  from  $S_{k-1}$
  - $\bullet$  performs  $w_k$  computations
  - outputs data of size  $\delta_k$  to  $\mathcal{S}_{k+1}$

Introduction

### The application: fork graphs



- n+1 stages  $S_k$ ,  $0 \le k \le n$ 
  - $S_0$ : root stage
  - $S_1$  to  $S_n$ : independent stages
- A data-set goes through stage  $S_0$ , then it can be executed simultaneously for all other stages





- p processors  $P_u$ ,  $1 \le u \le p$ , fully interconnected
- $s_u$ : speed of processor  $P_u$
- bidirectional link link $u,v: P_u \to P_v$ , bandwidth  $b_{u,v}$
- one-port model: each processor can either send, receive or compute at any time-step



# Different platforms

Introduction

Fully Homogeneous – Identical processors ( $s_u = s$ ) and links  $(b_{\mu,\nu} = b)$ : typical parallel machines

Communication Homogeneous – Different-speed processors  $(s_u \neq s_v)$ , identical links  $(b_{u,v} = b)$ : networks of workstations, clusters

Fully Heterogeneous – Fully heterogeneous architectures,  $s_{ij} \neq s_{ij}$ and  $b_{\mu,\nu} \neq b_{\mu',\nu'}$ : hierarchical platforms, grids



- Consecutive data-sets fed into the workflow
- Period  $T_{\text{period}}$  = time interval between beginning of execution of two consecutive data sets (throughput=1/ $T_{\text{period}}$ )
- Latency  $T_{\text{latency}}(x) = \text{time elapsed between beginning and}$ end of execution for a given data set x, and  $T_{\text{latency}} = \max_{x} T_{\text{latency}}(x)$
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{period}$  or  $T_{latency}$  or bi-criteria minimization



- Consecutive data-sets fed into the workflow
- Period  $T_{period}$  = time interval between beginning of execution of two consecutive data sets (throughput= $1/T_{period}$ )
- Latency  $T_{latency}(x) = time$  elapsed between beginning and end of execution for a given data set x, and  $T_{\text{latency}} = \max_{x} T_{\text{latency}}(x)$
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{period}$  or  $T_{latency}$  or bi-criteria minimization



### Stage types

- Monolithic stages: must be mapped on one single processor since computation for a data-set may depend on result of previous computation
- Replicable stages: can be replicated on several processors, but not parallel, i.e. a data-set must be entirely processed on a single processor
- Data-parallel stages: inherently parallel stages, one data-set can be computed in parallel by several processors



Conclusion

### Replication

Introduction

### **Replicate** stage $S_k$ on $P_1, \ldots, P_a$

$$\mathcal{S}_k$$
 on  $P_1$ : data sets 1, 4, 7, ...   
 ...  $\mathcal{S}_{k-1}$  --  $\mathcal{S}_k$  on  $P_2$ : data sets 2, 5, 8, ... --  $\mathcal{S}_{k+1}$  ...   
  $\mathcal{S}_k$  on  $P_3$ : data sets 3, 5, 9, ...

- $S_{k+1}$  may be monolithic: output order must be respected
- Round-robin rule to ensure output order
- Cannot feed more fast processors than slow ones
- Most efficient with similar-speed processors



### Replication

Introduction

**Replicate** stage  $S_k$  on  $P_1, \ldots, P_n$ 

Example

$$\mathcal{S}_k$$
 on  $P_1$ : data sets 1, 4, 7, ...   
 ...  $\mathcal{S}_{k-1}$   $\stackrel{}{--}$   $\stackrel{}{\mathcal{S}_k}$  on  $P_2$ : data sets 2, 5, 8, ...  $\stackrel{}{--}$   $\mathcal{S}_{k+1}$  ...   
  $\mathcal{S}_k$  on  $P_3$ : data sets 3, 5, 9, ...

- $S_{k+1}$  may be monolithic: output order must be respected
- Round-robin rule to ensure output order
- Cannot feed more fast processors than slow ones
- Most efficient with similar-speed processors

### Data-parallelism

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_a$ 

- Perfect sharing of the work
- Data-parallelize single stage only

### Data-parallelism

Introduction

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_a$ 

$$S_k$$
  $(w = 16)$   $P_1$   $(s_1 = 2)$ : ••••

 $P_2$   $(s_2 = 1)$ : •••

 $P_3$   $(s_3 = 1)$ : •••

- Perfect sharing of the work
- Data-parallelize single stage only

### INTERVAL MAPPING for pipeline graphs

Introduction

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $l_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{alloc(j)}$

$$T_{\text{period}} = \max_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{\mathsf{e}_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} \right\} + \frac{\delta_\mathsf{n}}{\mathsf{b}_{\mathsf{alloc}(m), \mathsf{alloc}(m + 1)}}$$

Anne.Benoit@ens-lyon.fr Tokyo, July 2007 Mapping skeleton workflows POP'07 15/44

### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- ullet Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$\mathcal{T}_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} \right\} + \frac{\delta_\mathsf{n}}{\mathsf{b}_{\mathsf{alloc}(m), \mathsf{alloc}(m + 1)}}$$

Introduction

Conclusion

15/44

### INTERVAL MAPPING for pipeline graphs

Introduction

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$\mathcal{T}_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} \right\} + \frac{\delta_\mathsf{n}}{\mathsf{b}_{\mathsf{alloc}(m), \mathsf{alloc}(m + 1)}}$$

Anne.Benoit@ens-lyon.fr Tokyo, July 2007 Mapping skeleton workflows POP'07

### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- ullet Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{\mathsf{e}_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{\mathsf{e}_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} \right\} + \frac{\delta_{\mathsf{n}}}{\mathsf{b}_{\mathsf{alloc}(m), \mathsf{alloc}(m + 1)}}$$

Introduction

Conclusion

Introduction

Conclusion

### Simpler problem, replication and data-parallelism

### No communication costs nor overheads

• Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s}$ 

Example

• Cost to data-parallelize  $[S_i, S_i]$   $(i = j \text{ for pipeline}; 0 < i \leq j \text{ or }$ i = j = 0 for fork) on k processors  $P_{a_1}, \ldots, P_{a_k}$ :

$$\frac{\sum_{\ell=i}^{j} w_{\ell}}{\sum_{u=1}^{k} s_{q_{u}}}$$



Introduction

Conclusion

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s}$

Example

• Cost to data-parallelize  $[S_i, S_i]$  (i = j for pipeline;  $0 < i \le j$  or i = j = 0 for fork) on k processors  $P_{a_1}, \ldots, P_{a_k}$ :

$$\frac{\sum_{\ell=i}^{j} w_{\ell}}{\sum_{u=1}^{k} s_{qu}}$$



Example

### Simpler problem, replication and data-parallelism

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s}$
- Cost to data-parallelize  $[S_i, S_i]$   $(i = j \text{ for pipeline}; 0 < i \le j \text{ or }$ i = j = 0 for fork) on k processors  $P_{a_1}, \ldots, P_{a_k}$ :

$$\frac{\sum_{\ell=i}^{J} w_{\ell}}{\sum_{u=1}^{k} s_{q_{u}}}$$

 $Cost = T_{period}$  of assigned processors Cost = delay to traverse the interval

Introduction

### Simpler problem, replication and data-parallelism

• Cost to replicate  $[S_i, S_i]$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{J} w_{\ell}}{k \times \min_{1 \leq u \leq k} s_{q_{u}}}.$$

 $Cost = T_{period}$  of assigned processors Delay to traverse the interval = time needed by slowest processor:

$$t_{\mathsf{max}} = rac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\mathsf{min}_{1 \leq u \leq k} \, \mathsf{s}_{q_u}}$$

• With these formulas: easy to compute  $T_{period}$  and  $T_{latency}$  for



#### Simpler problem, replication and data-parallelism

• Cost to replicate  $[S_i, S_i]$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{J} w_{\ell}}{k \times \min_{1 \leq u \leq k} s_{q_{u}}}.$$

 $Cost = T_{period}$  of assigned processors Delay to traverse the interval = time needed by slowest processor:

$$t_{\mathsf{max}} = rac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\mathsf{min}_{1 \leq u \leq k} \, \mathsf{s}_{q_u}}$$

• With these formulas: easy to compute  $T_{period}$  and  $T_{latency}$  for pipeline graphs



#### Outline

- Framework
- Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria
- Conclusion



# Working out an example

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

Optimal period?



# Working out an example

Interval mapping, 4 processors,  $s_1=2$  and  $s_2=s_3=s_4=1$ 

#### Optimal period?

Introduction

$$T_{\mathsf{period}} = \mathsf{7}, \; \mathcal{S}_1 o P_1, \; \mathcal{S}_2 \mathcal{S}_3 o P_2, \; \mathcal{S}_4 o P_3 \; ig( T_{\mathsf{latency}} = \mathsf{17} ig)$$

Optimal latency?



Conclusion

# Working out an example

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

Introduction

$$T_{\mathsf{period}} = \mathsf{7}$$
,  $\mathcal{S}_1 o P_1$ ,  $\mathcal{S}_2 \mathcal{S}_3 o P_2$ ,  $\mathcal{S}_4 o P_3$   $\left(T_{\mathsf{latency}} = \mathsf{17}\right)$ 

#### Optimal latency?

$$T_{\mathsf{latency}} = 1$$
2,  $\mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 o P_1 \; (T_{\mathsf{period}} = 12)$ 

Min. latency if  $T_{\text{period}} \leq 10$ ?

Conclusion

Introduction

Conclusion

#### Working out an example

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

#### Optimal period?

$$T_{\mathsf{period}} = 7, \ \mathcal{S}_1 o P_1, \ \mathcal{S}_2 \mathcal{S}_3 o P_2, \ \mathcal{S}_4 o P_3 \ (T_{\mathsf{latency}} = 17)$$

#### Optimal latency?

$$T_{\mathsf{latency}} = 12$$
,  $\mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 o P_1 \; (T_{\mathsf{period}} = 12)$ 

Min. latency if 
$$T_{\text{period}} \leq 10$$
?

$$T_{\text{latency}} = 14, \ \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \rightarrow P_1, \ \mathcal{S}_4 \rightarrow P_2$$



Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Replicate** interval  $[S_{\mu}...S_{\nu}]$  on  $P_1, ..., P_{\sigma}$ 

$$S_u \dots S_v$$
 on  $P_1$ : data sets 1, 4, 7, ...   
  $S_u \dots S_v$  on  $P_2$ : data sets 2, 5, 8, ...  $S_u \dots S_v$  on  $S_v \dots S_v$ 

$$T_{\text{period}} = \frac{\sum_{k=u}^{V} w_k}{q \times \min_i(s_i)}$$
 and  $T_{\text{latency}} = q \times T_{\text{period}}$ 

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Data Parallelize** single stage  $S_k$  on  $P_1, \ldots, P_n$ 

$$T_{
m period} = rac{{\sf w}_k}{\sum_{i=1}^q {\sf s}_i}$$
 and  $T_{
m latency} = T_{
m period}$ 

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

Optimal period?



Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

#### Optimal period?

$$\mathcal{S}_1 \overset{\mathrm{DP}}{\underset{\rightarrow}{\longrightarrow}} P_1 P_2$$
,  $\mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \overset{\mathrm{REP}}{\underset{\rightarrow}{\longrightarrow}} P_3 P_4$ 

$$T_{\text{period}} = \max(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5, T_{\text{latency}} = 14.67$$

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

#### Optimal period?

$$\mathcal{S}_1 \overset{\mathrm{DP}}{\longrightarrow} P_1 P_2, \, \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \overset{\mathrm{REP}}{\longrightarrow} P_3 P_4$$

Example

$$T_{\text{period}} = \max(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5, \ T_{\text{latency}} = 14.67$$

$$S_1 \xrightarrow{\mathrm{DP}} P_2 P_3 P_4, S_2 S_3 S_4 \rightarrow P_1$$

$$T_{\text{period}} = \max(\frac{14}{1+1+1}, \frac{4+2+4}{2}) = 5, T_{\text{latency}} = 9.67 \text{ (optimal)}$$



Conclusion

Introduction

- Framework
- 3 Part 1 Communications, monolithic stages, mono-criterion

# Part 1

- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



Conclusion

# Part 1

- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



Conclusion

|                    | Fully Hom. | Comm. Hom. |
|--------------------|------------|------------|
| One-to-one Mapping |            |            |
| Interval Mapping   |            |            |
| General Mapping    |            |            |

# Complexity results

|                    | Fully Hom. | Comm. Hom. |
|--------------------|------------|------------|
| One-to-one Mapping | polynomial | polynomial |
| Interval Mapping   |            |            |
| General Mapping    |            |            |

- Binary search polynomial algorithm for ONE-TO-ONE Mapping

# Complexity results

|                    | Fully Hom. | Comm. Hom.  |
|--------------------|------------|-------------|
| One-to-one Mapping | polynomial | polynomial  |
| Interval Mapping   | polynomial | NP-complete |
| General Mapping    |            |             |

- Binary search polynomial algorithm for ONE-TO-ONE Mapping
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)

•

# Complexity results

|                    | Fully Hom.                  | Comm. Hom.  |
|--------------------|-----------------------------|-------------|
| One-to-one Mapping | polynomial                  | polynomial  |
| Interval Mapping   | polynomial                  | NP-complete |
| General Mapping    | same complexity as Interval |             |

- Binary search polynomial algorithm for ONE-TO-ONE Mapping
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)
- General mapping: same complexity as INTERVAL MAPPING

•



Conclusion

# Complexity results

|                    | Fully Hom.                  | Comm. Hom.  |
|--------------------|-----------------------------|-------------|
| One-to-one Mapping | polynomial                  | polynomial  |
| Interval Mapping   | polynomial                  | NP-complete |
| General Mapping    | same complexity as Interval |             |

- Binary search polynomial algorithm for ONE-TO-ONE Mapping
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)
- General mapping: same complexity as INTERVAL MAPPING
- All problem instances NP-complete on Fully Heterogeneous platforms



Conclusion

# One-to-one/Comm. Hom.: binary search algorithm

- Work with fastest n processors, numbered  $P_1$  to  $P_n$ , where  $s_1 < s_2 < \ldots < s_n$
- Mark all stages  $S_1$  to  $S_n$  as free
- For u=1 to n
  - Pick up any free stage  $S_k$  s.t.  $\delta_{k-1}/b + w_k/s_u + \delta_k/b \leq T_{period}$
  - Assign  $S_k$  to  $P_u$ , and mark  $S_k$  as already assigned
  - If no stage found return "failure"
- Proof: exchange argument



#### One-to-one/Comm. Hom.: binary search algorithm

- Work with fastest n processors, numbered  $P_1$  to  $P_n$ , where  $s_1 < s_2 < \ldots < s_n$
- Mark all stages  $S_1$  to  $S_n$  as free
- For u=1 to n
  - Pick up any free stage  $S_k$  s.t.  $\delta_{k-1}/b + w_k/s_u + \delta_k/b \leq T_{period}$
  - Assign  $S_k$  to  $P_u$ , and mark  $S_k$  as already assigned
  - If no stage found return "failure"
- Proof: exchange argument



Framework

Introduction

Target clusters: Com. hom. platforms and INTERVAL MAPPING

H1a-GR: random – fixed intervals

H1b-GRIL: random interval length

H2-GSW: biggest  $\sum w$  – Place interval with most computations on fastest processor

H3-GSD: biggest  $\delta_{in} + \delta_{out}$  – Intervals are sorted by communications  $(\delta_{in} + \delta_{out})$ in: first stage of interval; (out -1): last one

H4-GP: biggest period on fastest processor – Balancing computation and communication: processors sorted by decreasing speed  $s_u$ ; for current processor u, choose interval with biggest period  $(\delta_{in} + \delta_{out})/b + \sum_{i \in Interval} w_i/s_u$ 

- H5-BS121: binary search for ONE-TO-ONE MAPPING optimal algorithm for ONE-TO-ONE MAPPING. When p < n, application cut in fixed intervals of length L.
- H6-SPL: splitting intervals Processors sorted by decreasing speed, all stages to first processor. At each step, select used proc j with largest period, split its interval (give fraction of stages to j'): minimize max(period(j), period(j')) and split if maximum period improved.
- H7a-BSL and H7b-BSC: binary search (longest/closest) Binary search on period P: start with stage s = 1, build intervals (s, s') fitting on processors. For each u, and each  $s' \geq s$ , compute period (s..s', u) and check whether it is smaller than P. **H7a**: maximizes s'; **H7b**: chooses the closest period.

Conclusion

Conclusion

#### Plan of experiments

- Assess performance of polynomial heuristics
- Random applications, n = 1 to 50 stages
- Random platforms, p = 10 and p = 100 processors
- $\bullet$  b = 10 (comm. hom.), proc. speed between 1 and 20
- Relevant parameters: ratios  $\frac{\delta}{h}$  and  $\frac{w}{s}$
- Average over 100 similar random appli/platform pairs

# Plan of experiments

- Assess performance of polynomial heuristics
- Random applications, n = 1 to 50 stages
- Random platforms, p = 10 and p = 100 processors
- b = 10 (comm. hom.), proc. speed between 1 and 20
- Relevant parameters: ratios  $\frac{\delta}{h}$  and  $\frac{w}{s}$
- Average over 100 similar random appli/platform pairs



#### Experiment 1 - balanced comm/comp, hom comm

- $\delta_i = 10$ , computation time between 1 and 20
- 10 processors



28/44

Conclusion

#### Experiment 1 - balanced comm/comp, hom comm

- $\delta_i = 10$ , computation time between 1 and 20
- 100 processors



#### Experiment $\overline{2}$ - balanced comm/comp, het comm

- communication time between 1 and 100
- computation time between 1 and 20



# Experiment 2 - balanced comm/comp, het comm

- communication time between 1 and 100
- computation time between 1 and 20



#### Experiment 3 - large computations

- communication time between 1 and 20
- computation time between 10 and 1000



#### Experiment 3 - large computations

- communication time between 1 and 20
- computation time between 10 and 1000



# Experiment 4 - small computations

- communication time between 1 and 20
- computation time between 0.01 and 10



- communication time between 1 and 20
- computation time between 0.01 and 10



# Summary of experiments

- Much more efficient than random mappings
- Three dominant heuristics for different cases.

- Important communications (het. or big): H6-SPL (splitting)



POP'07

Example

Conclusion

#### Summary of experiments

- Much more efficient than random mappings
- Three dominant heuristics for different cases.
- Insignificant communications (hom. or small) and many processors: H5-BS121 (ONE-TO-ONE MAPPING)
- Insignificant communications (hom. or small) and few processors: H7b-BSC (binary search: clever choice where to split)
- Important communications (het. or big): H6-SPL (splitting) choice relevant for any number of processors)

# Framework

- Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria



- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- Interval Mapping only
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



Pipeline and fork graphs

Framework

- Different platforms, without communications
- Interval Mapping only
- Replicable stages, and either data-parallelism or not
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- Interval Mapping only
- Replicable stages, and either data-parallelism or not
- Bi-criteria optimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- INTERVAL MAPPING only
- Replicable stages, and either data-parallelism or not
- Bi-criteria optimization
- Complexity results only

Without data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline | -          |           |             |
| Het. pipeline | Poly (str) |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

With data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline | -          |           |             |
| Het. pipeline | Poly (DP)  |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

Without data-parallelism, Heterogeneous platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     | Poly (*)     |            |             |
| Het. fork     | NP-hard      | -          | -           |

With data-parallelism, Heterogeneous platforms

| Objective     | period  | latency | bi-criteria |
|---------------|---------|---------|-------------|
| Hom. pipeline | NP-hard |         |             |
| Het. pipeline |         | -       |             |
| Hom. fork     |         | NP-har  | d           |
| Het. fork     |         | -       |             |

### Most interesting case: Without data-parallelism, Heterogeneous platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     |              | Poly (*)   |             |
| Het. fork     | NP-hard      | -          | -           |

### No data-parallelism, *Heterogeneous* platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline



### No data-parallelism, Heterogeneous platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline



Framework

Conclusion

### Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_a$ .

There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \le r \le k \le q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \le r < k$ .



Conclusion

### Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_a$ .

There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \le r \le k \le q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \le r < k$ .

Proof: exchange argument, which does not increase latency



### Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency



Introduction

### Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency



Introduction

$$L(m, i, j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \le K \\ L(m', i, k) + L(m-m', k+1, j) \end{cases} (2)$$

- Case (1): replicating m stages onto processors  $P_i, ..., P_j$
- Case (2): splitting the interval



Introduction

### Dynamic programming algorithm

• Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors  $P_i$  to  $P_j$ , while fitting in period K.

$$L(m, i, j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \le K \\ L(m', i, k) + L(m - m', k + 1, j) \end{cases} (2)$$

#### Initialization:

Introduction

$$L(1, i, j) = \begin{cases} \frac{w}{s_i} & \text{if } \frac{w}{(j-i).s_i} \leq K \\ +\infty & \text{otherwise} \end{cases}$$

$$L(m, i, i) = \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{s_i} \leq K \\ +\infty & \text{otherwise} \end{cases}$$



### Dynamic programming algorithm

• Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors  $P_i$  to  $P_i$ , while fitting in period K.

$$L(m, i, j) = \min_{\substack{1 \leq m' < m \\ i \leq k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \leq K \\ L(m', i, k) + L(m-m', k+1, j) \end{cases} (2)$$

- Complexity of the dynamic programming:  $O(n^2.p^4)$
- Number of iterations of the binary search formally bounded, very small number of iterations in practice.



- Tramework
- Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria
- Conclusion



Introduction

- Subblok and Vondran— Extension of their work (pipeline on hom platforms)
- Chains-to-chains- In our work possibility to replicate or data-parallelize
- Mapping pipelined computations onto clusters and grids— DAG [Taura et al.], DataCutter [Saltz et al.]
- Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization
- Mapping pipelined computations onto special-purpose architectures— FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.]
- Mapping skeletons onto clusters and grids— Use of stochastic process algebra [Benoit et al.]



**POP'07** 

### Conclusion

Theoretical side — Complexity results for several cases

Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages

#### Practical side

- Optimal polynomial algorithms, heuristics for NP-hard instances of the problem
- Experiments: Comparison of heuristics performance
- Linear program to assess the absolute performance of the heuristics, which turns out to be quite good



### Future work

Introduction

#### Short term

- Heuristics for Fully Heterogeneous platforms and other NP-hard instances of the problem
- Extension to DAG-trees (a DAG which is a tree when un-oriented)

#### Longer term

- Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels
- Real experiments on heterogeneous clusters, using an already-implemented skeleton library and MPI
- Comparison of effective performance against theoretical performance



### Open problems

- Replication for fault-tolerance vs replication for parallelism
  - compute several time the same data-set in case of failure
  - uses more resources and does not decrease period or latency
  - increases robustness
- Energy savings
  - processors that can run at different frequencies
  - trade-off between energy consumption and speed
- Simultaneous execution of several (concurrent) workflows
  - competition for CPU and network resources
  - fairness between applications (stretch)
  - sensitivity to application/platform parameter changes

