# Mapping skeleton workflows onto heterogeneous platforms

#### Anne Benoit, Veronika Rehn, Yves Robert

#### GRAAL team, LIP École Normale Supérieure de Lyon

December 2007

• Mapping applications onto parallel platforms Difficult challenge

- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

• Mapping applications onto parallel platforms Difficult challenge

- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Mapping applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping

- Workflow: several consecutive data-set enter pipeline
- Map each pipeline stage on a single processor (extended later)
- Goal: maximize throughput (extended later)
- Several mapping strategies



The pipeline application

- Workflow: several consecutive data-set enter pipeline
- Map each pipeline stage on a single processor (extended later)
- Goal: maximize throughput (extended later)
- Several mapping strategies



The pipeline application

- Workflow: several consecutive data-set enter pipeline
- Map each pipeline stage on a single processor (extended later)
- Goal: maximize throughput (extended later)
- Several mapping strategies



- Workflow: several consecutive data-set enter pipeline
- Map each pipeline stage on a single processor (extended later)
- Goal: maximize throughput (extended later)
- Several mapping strategies



- Workflow: several consecutive data-set enter pipeline
- Map each pipeline stage on a single processor (extended later)
- Goal: maximize throughput (extended later)
- Several mapping strategies



#### Major contributions

#### Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution

Practice Heuristics for INTERVAL MAPPING on clusters Experiments to compare heuristics and evaluate their absolute performance

## Major contributions

Theory Formal approach to the problem, definition of replication and data-parallelism Problem complexity for several cases Integer linear program for exact resolution

Practice Heuristics for INTERVAL MAPPING on clusters Experiments to compare heuristics and evaluate their absolute performance

## Outline



- 2 Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria



#### Outline



- 2 Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria

5 Conclusion

3 1 4

#### The application: pipeline graphs



- n stages  $\mathcal{S}_k$ ,  $1 \leq k \leq$  n
- S<sub>k</sub>:
  - receives input of size  $\delta_{k-1}$  from  $\mathcal{S}_{k-1}$
  - performs w<sub>k</sub> computations
  - outputs data of size  $\delta_k$  to  $\mathcal{S}_{k+1}$

## The application: fork graphs



- $\mathsf{n} + 1$  stages  $\mathcal{S}_k$ ,  $\mathsf{0} \leq k \leq \mathsf{n}$ 
  - $\mathcal{S}_0$ : root stage
  - $S_1$  to  $S_n$ : independent stages
- A data-set goes through stage  $S_0$ , then it can be executed simultaneously for all other stages

# The platform



- p processors  $P_u$ ,  $1 \le u \le p$ , fully interconnected
- $s_u$ : speed of processor  $P_u$
- bidirectional link link<sub>u,v</sub> :  $P_u \rightarrow P_v$ , bandwidth b<sub>u,v</sub>
- one-port model: each processor can either send, receive or compute at any time-step

## Different platforms

Fully Homogeneous – Identical processors  $(s_u = s)$  and links  $(b_{u,v} = b)$ : typical parallel machines

Communication Homogeneous – Different-speed processors  $(s_u \neq s_v)$ , identical links  $(b_{u,v} = b)$ : networks of workstations, clusters

$$\label{eq:fully Heterogeneous} \begin{split} & \textit{Fully Heterogeneous} - \textit{Fully heterogeneous architectures, } s_u \neq s_v \\ & \text{and } b_{u,v} \neq b_{u',v'} \text{: hierarchical platforms, grids} \end{split}$$

- Consecutive data-sets fed into the workflow
- Period  $T_{\text{period}}$  = time interval between beginning of execution of two consecutive data sets (throughput=1/ $T_{\text{period}}$ )
- Latency T<sub>latency</sub>(x) = time elapsed between beginning and end of execution for a given data set x, and T<sub>latency</sub> = max<sub>x</sub> T<sub>latency</sub>(x)
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{\text{period}}$  or  $T_{\text{latency}}$  or bi-criteria minimization

• • = • • = •

- Consecutive data-sets fed into the workflow
- Period  $T_{\text{period}} = \text{time interval between beginning of execution}$ of two consecutive data sets (throughput=1/ $T_{\text{period}}$ )
- Latency T<sub>latency</sub>(x) = time elapsed between beginning and end of execution for a given data set x, and T<sub>latency</sub> = max<sub>x</sub> T<sub>latency</sub>(x)
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{\text{period}}$  or  $T_{\text{latency}}$  or bi-criteria minimization

通 と く ヨ と く ヨ と

- Monolithic stages: must be mapped on one single processor since computation for a data-set may depend on result of previous computation
- Replicable stages: can be replicated on several processors, but not parallel, *i.e.* a data-set must be entirely processed on a single processor
- Data-parallel stages: inherently parallel stages, one data-set can be computed in parallel by several processors

#### Replication

**Replicate** stage  $S_k$  on  $P_1, \ldots, P_q$ 

- $S_{k+1}$  may be monolithic: output order must be respected
- Round-robin rule to ensure output order
- Cannot feed more fast processors than slow ones
- Most efficient with similar-speed processors

#### Replication

**Replicate** stage  $S_k$  on  $P_1, \ldots, P_q$ 

- $S_{k+1}$  may be monolithic: output order must be respected
- Round-robin rule to ensure output order
- Cannot feed more fast processors than slow ones
- Most efficient with similar-speed processors

# Data-parallelism

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_q$ 

• Perfect sharing of the work

• Data-parallelize single stage only

∃ ▶ ∢

## Data-parallelism

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_q$ 

- Perfect sharing of the work
- Data-parallelize single stage only

#### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications

• Partition of [1..n] into 
$$m$$
 intervals  $l_j = [d_j, e_j]$   
(with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m - 1$  and  $e_m = n$ )

• Interval  $I_j$  mapped onto processor  $P_{\text{alloc}(j)}$ 

$$T_{\text{period}} = \max_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j-1), \text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} + \frac{\delta_{e_j}}{b_{\text{alloc}(j), \text{alloc}(j+1)}} \right\}$$
$$T_{\text{latency}} = \sum_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j-1), \text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} \right\} + \frac{\delta_n}{b_{\text{alloc}(m), \text{alloc}(m+1)}}$$

#### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$ (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m - 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\text{alloc}(j)}$

$$T_{\text{period}} = \max_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j - 1), \text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} + \frac{\delta_{e_j}}{b_{\text{alloc}(j), \text{alloc}(j + 1)}} \right\}$$
$$T_{\text{latency}} = \sum_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j - 1), \text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} \right\} + \frac{\delta_n}{b_{\text{alloc}(m), \text{alloc}(m + 1)}}$$

> < E > < E >

#### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$ (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m - 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\text{alloc}(j)}$

$$T_{\text{period}} = \max_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j-1),\text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} + \frac{\delta_{e_j}}{b_{\text{alloc}(j),\text{alloc}(j+1)}} \right\}$$
$$T_{\text{latency}} = \sum_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j-1),\text{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} \right\} + \frac{\delta_n}{b_{\text{alloc}(m),\text{alloc}(m+1)}}$$

> < E > < E >

#### INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$ (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m - 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\text{alloc}(j)}$

$$T_{\text{period}} = \max_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j - 1), \text{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} + \frac{\delta_{e_j}}{b_{\text{alloc}(j), \text{alloc}(j + 1)}} \right\}$$
$$T_{\text{latency}} = \sum_{1 \le j \le m} \left\{ \frac{\delta_{d_j - 1}}{b_{\text{alloc}(j - 1), \text{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} w_i}{s_{\text{alloc}(j)}} \right\} + \frac{\delta_n}{b_{\text{alloc}(m), \text{alloc}(m + 1)}}$$

#### Simpler problem, replication and data-parallelism

#### • No communication costs nor overheads

- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s_u}$
- Cost to data-parallelize  $[S_i, S_j]$   $(i = j \text{ for pipeline}; 0 < i \le j \text{ or } i = j = 0 \text{ for fork})$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\sum_{u=1}^{k} \mathsf{s}_{q_{u}}}$$

 $Cost = T_{period}$  of assigned processors Cost = delay to traverse the interval

#### Simpler problem, replication and data-parallelism

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{W_i}{S_u}$
- Cost to data-parallelize  $[S_i, S_j]$   $(i = j \text{ for pipeline}; 0 < i \le j \text{ or } i = j = 0 \text{ for fork})$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\sum_{u=1}^{k} \mathsf{s}_{q_{u}}}$$

 $Cost = T_{period}$  of assigned processors Cost = delay to traverse the interval

#### Simpler problem, replication and data-parallelism

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s_u}$
- Cost to data-parallelize  $[S_i, S_j]$   $(i = j \text{ for pipeline}; 0 < i \le j \text{ or } i = j = 0 \text{ for fork})$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\sum_{u=1}^{k} \mathsf{s}_{q_{u}}}$$

 $Cost = T_{period}$  of assigned processors Cost = delay to traverse the interval

## Simpler problem, replication and data-parallelism

• Cost to replicate  $[S_i, S_j]$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{k \times \min_{1 \le u \le k} \mathsf{s}_{q_u}}$$

Cost =  $T_{period}$  of assigned processors Delay to traverse the interval = time needed by slowest processor:

$$t_{\max} = \frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\min_{1 \le u \le k} \mathsf{s}_{q_u}}$$

 With these formulas: easy to compute T<sub>period</sub> and T<sub>latency</sub> for pipeline graphs

### Simpler problem, replication and data-parallelism

• Cost to replicate  $[S_i, S_j]$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{k \times \min_{1 \le u \le k} \mathsf{s}_{q_u}}$$

Cost =  $T_{period}$  of assigned processors Delay to traverse the interval = time needed by slowest processor:

$$t_{\max} = \frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\min_{1 \le u \le k} \mathsf{s}_{q_u}}$$

 With these formulas: easy to compute T<sub>period</sub> and T<sub>latency</sub> for pipeline graphs

#### Mono-criterion

Minimize T<sub>period</sub>
 Minimize T<sub>latency</sub>

**Bi-criteria** 

- How to define it?
  - Minimize  $\alpha$ .  $T_{period} + \beta$ .  $T_{latency}$ ?
- Values which are not comparable
- Minimize  $T_{period}$  for a fixed latency
- Minimize  $T_{\text{latency}}$  for a fixed period

3 1 4

#### Mono-criterion

Minimize T<sub>period</sub>
 Minimize T<sub>latency</sub>

#### **Bi-criteria**

- How to define it? Minimize α. T<sub>period</sub> + β. T<sub>latency</sub>?
- Values which are not comparable
- Minimize  $T_{period}$  for a fixed latency
- Minimize  $T_{\text{latency}}$  for a fixed period

#### Mono-criterion

- Minimize  $T_{\text{period}}$
- Minimize T<sub>latency</sub>

#### **Bi-criteria**

- How to define it? Minimize α. T<sub>period</sub> + β. T<sub>latency</sub>?
- Values which are not comparable
- Minimize *T*<sub>period</sub> for a fixed latency
  Minimize *T*<sub>latency</sub> for a fixed period

#### Mono-criterion

- Minimize *T*<sub>period</sub>
- Minimize T<sub>latency</sub>

#### **Bi-criteria**

- How to define it? Minimize α. T<sub>period</sub> + β. T<sub>latency</sub>?
- Values which are not comparable
- Minimize  $T_{\text{period}}$  for a fixed latency
- Minimize  $T_{\text{latency}}$  for a fixed period

#### Outline

# 1 Framework

# 2 Working out an example

3 Part 1 - Communications, monolithic stages, mono-criterion

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

#### 5 Conclusion

3 1 4

# Working out an example

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Optimal period?** 

3 1 4

6

### Working out an example

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

# **Optimal period?** $T_{\text{period}} = 7, \ \mathcal{S}_1 \rightarrow \mathcal{P}_1, \ \mathcal{S}_2 \mathcal{S}_3 \rightarrow \mathcal{P}_2, \ \mathcal{S}_4 \rightarrow \mathcal{P}_3 \ (T_{\text{latency}} = 17)$

**Optimal latency?** 

6

#### Working out an example

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

$$\begin{array}{l} \label{eq:period} \hline \textbf{Optimal period}?\\ T_{\mathsf{period}} = \mathsf{7}, \ \mathcal{S}_1 \to \mathcal{P}_1, \ \mathcal{S}_2 \mathcal{S}_3 \to \mathcal{P}_2, \ \mathcal{S}_4 \to \mathcal{P}_3 \ \big( \ T_{\mathsf{latency}} = 17 \big) \end{array}$$

**Optimal latency?**  $T_{\text{latency}} = 12, \ S_1 S_2 S_3 S_4 \rightarrow P_1 \ (T_{\text{period}} = 12)$ 

Min. latency if  $T_{period} \leq 10$ ?

### Working out an example

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

$$\begin{array}{l} \begin{array}{l} \mbox{Optimal period}?\\ \mbox{$T_{\rm period}=7$, $$\mathcal{S}_1 \rightarrow $P_1$, $$\mathcal{S}_2\mathcal{S}_3 \rightarrow $P_2$, $$\mathcal{S}_4 \rightarrow $P_3$ ($$T_{\rm latency}=17$)$ \end{array}$$

**Optimal latency?**  $T_{\text{latency}} = 12, \ S_1 S_2 S_3 S_4 \rightarrow P_1 \ (T_{\text{period}} = 12)$ 

Min. latency if  $T_{period} \leq 10$ ?  $T_{\text{latency}} = 14, \ \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \rightarrow \mathcal{P}_1, \ \mathcal{S}_4 \rightarrow \mathcal{P}_2$ 

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Replicate** interval  $[S_u ... S_v]$  on  $P_1, \ldots, P_a$ 

$$T_{ ext{period}} = rac{\sum_{k=u}^{v} \mathsf{w}_k}{q imes \mathsf{min}_i(\mathsf{s}_i)}$$
 and  $T_{ ext{latency}} = q imes T_{ ext{period}}$ 

Anne.Benoit@ens-lyon.fr

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Data Parallelize** single stage  $S_k$  on  $P_1, \ldots, P_q$ 

Anne.Benoit@ens-lyon.fr

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Optimal period?** 

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Optimal period?** 

$$\mathcal{S}_1 \stackrel{\mathrm{DP}}{\xrightarrow{}} \mathcal{P}_1 \mathcal{P}_2, \ \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \stackrel{\mathrm{REP}}{\xrightarrow{}} \mathcal{P}_3 \mathcal{P}_4$$

$$T_{\text{period}} = \max(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5$$
,  $T_{\text{latency}} = 14.67$ 

Interval mapping, 4 processors,  $s_1 = 2$  and  $s_2 = s_3 = s_4 = 1$ 

**Optimal period?** 

$$\mathcal{S}_1 \stackrel{\mathrm{DP}}{\longrightarrow} \mathcal{P}_1 \mathcal{P}_2, \ \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \stackrel{\mathrm{REP}}{\longrightarrow} \mathcal{P}_3 \mathcal{P}_4$$

$$T_{\text{period}} = \max(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5, \ T_{\text{latency}} = 14.67$$

$$\begin{array}{l} \mathcal{S}_1 \xrightarrow{\mathrm{DP}} P_2 P_3 P_4, \ \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \rightarrow P_1 \\ \\ \mathcal{T}_{\mathsf{period}} = \max(\frac{14}{1+1+1}, \frac{4+2+4}{2}) = 5, \ \mathcal{T}_{\mathsf{latency}} = 9.67 \ (\mathsf{optimal}) \end{array}$$

Anne.Benoit@ens-lyon.fr

#### Outline





### 3 Part 1 - Communications, monolithic stages, mono-criterion

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

5 Conclusion

∃ ▶ ∢



- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments

|                    | Fully Hom. | Comm. Hom. |
|--------------------|------------|------------|
| One-to-one Mapping |            |            |
| Interval Mapping   |            |            |
| General Mapping    |            |            |

۲

A (10) < A (10) </p>

|                    | Fully Hom. | Comm. Hom. |
|--------------------|------------|------------|
| One-to-one Mapping | polynomial | polynomial |
| Interval Mapping   |            |            |
| General Mapping    |            |            |

- Binary search polynomial algorithm for ONE-TO-ONE MAPPING
- •
- •

|                    | Fully Hom. | Comm. Hom.  |
|--------------------|------------|-------------|
| One-to-one Mapping | polynomial | polynomial  |
| Interval Mapping   | polynomial | NP-complete |
| General Mapping    |            |             |

- Binary search polynomial algorithm for ONE-TO-ONE MAPPING
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)

۲

۲

|                    | Fully Hom.                  | Comm. Hom.  |
|--------------------|-----------------------------|-------------|
| One-to-one Mapping | polynomial                  | polynomial  |
| Interval Mapping   | polynomial                  | NP-complete |
| General Mapping    | same complexity as Interval |             |

- Binary search polynomial algorithm for ONE-TO-ONE MAPPING
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)
- General mapping: same complexity as INTERVAL MAPPING

|                    | Fully Hom.                  | Comm. Hom.  |
|--------------------|-----------------------------|-------------|
| One-to-one Mapping | polynomial                  | polynomial  |
| Interval Mapping   | polynomial                  | NP-complete |
| General Mapping    | same complexity as Interval |             |

- Binary search polynomial algorithm for ONE-TO-ONE MAPPING
- Dynamic programming algorithm for INTERVAL MAPPING on Hom. platforms (NP-hard otherwise)
- General mapping: same complexity as INTERVAL MAPPING
- All problem instances NP-complete on *Fully Heterogeneous* platforms

#### One-to-one/Comm. Hom.: binary search algorithm

- $\bullet$  Work with fastest n processors, numbered  ${\it P}_1$  to  ${\it P}_n,$  where  $s_1 \leq s_2 \leq \ldots \leq s_n$
- Mark all stages  $\mathcal{S}_1$  to  $\mathcal{S}_n$  as free
- **For** *u* = 1 **to** n
  - Pick up any free stage  $S_k$  s.t.  $\delta_{k-1}/b + w_k/s_u + \delta_k/b \le T_{period}$
  - Assign  $\mathcal{S}_k$  to  $\mathcal{P}_u$ , and mark  $\mathcal{S}_k$  as already assigned
  - If no stage found return "failure"
- Proof: exchange argument

#### One-to-one/Comm. Hom.: binary search algorithm

- $\bullet$  Work with fastest n processors, numbered  ${\it P}_1$  to  ${\it P}_n,$  where  $s_1 \leq s_2 \leq \ldots \leq s_n$
- Mark all stages  $\mathcal{S}_1$  to  $\mathcal{S}_n$  as free
- **For** *u* = 1 **to** n
  - Pick up any free stage  $S_k$  s.t.  $\delta_{k-1}/b + w_k/s_u + \delta_k/b \le T_{period}$
  - Assign  $S_k$  to  $P_u$ , and mark  $S_k$  as already assigned
  - If no stage found return "failure"
- Proof: exchange argument

#### Greedy heuristics

Target clusters: *Com. hom.* platforms and INTERVAL MAPPING H1a-GR: random – fixed intervals H1b-GRIL: random interval length H2-GSW: biggest  $\sum w$  – Place interval with most computations on fastest processor H3-GSD: biggest  $\delta_{in} + \delta_{out}$  – Intervals are sorted by communications ( $\delta_{in} + \delta_{out}$ ) *in*: first stage of interval; (out - 1): last one H4-GP: biggest period on fastest processor – Balancing computation and communication: processors sorted by decreasing speed  $s_{\mu}$ ; for current processor u,

choose interval with biggest period

 $(\delta_{in} + \delta_{out})/b + \sum_{i \in Interval} w_i/s_u$ 

#### Sophisticated heuristics

H5-BS121: binary search for ONE-TO-ONE MAPPING – optimal algorithm for ONE-TO-ONE MAPPING. When p < n, application cut in fixed intervals of length *L*.

H6-SPL: splitting intervals – Processors sorted by decreasing speed, all stages to first processor. At each step, select used proc j with largest period, split its interval (give fraction of stages to j'): minimize max(period(j), period(j')) and split if maximum period improved.

H7a-BSL and H7b-BSC: binary search (longest/closest) – Binary search on period P: start with stage s = 1, build intervals (s, s') fitting on processors. For each u, and each  $s' \ge s$ , compute period (s..s', u) and check whether it is smaller than P. H7a: maximizes s'; H7b: chooses the closest period.

#### Plan of experiments

#### • Assess performance of polynomial heuristics

- Random applications, n = 1 to 50 stages
- Random platforms, p = 10 and p = 100 processors
- b = 10 (comm. hom.), proc. speed between 1 and 20
- Relevant parameters: ratios  $\frac{\delta}{b}$  and  $\frac{w}{s}$
- Average over 100 similar random appli/platform pairs

#### Plan of experiments

- Assess performance of polynomial heuristics
- Random applications, n = 1 to 50 stages
- $\bullet\,$  Random platforms, p=10 and p=100 processors
- b = 10 (comm. hom.), proc. speed between 1 and 20
- Relevant parameters: ratios  $\frac{\delta}{b}$  and  $\frac{w}{s}$
- Average over 100 similar random appli/platform pairs

# Experiment 1 - balanced comm/comp, hom comm

- $\delta_i = 10$ , computation time between 1 and 20
- 10 processors



# Experiment 1 - balanced comm/comp, hom comm

- $\delta_i = 10$ , computation time between 1 and 20
- 100 processors



# Experiment 2 - balanced comm/comp, het comm

- communication time between 1 and 100
- computation time between 1 and 20



Anne.Benoit@ens-lyon.fr

# Experiment 2 - balanced comm/comp, het comm

- communication time between 1 and 100
- computation time between 1 and 20



Anne.Benoit@ens-lyon.fr

### Experiment 3 - large computations

- communication time between 1 and 20
- computation time between 10 and 1000



### Experiment 3 - large computations

- communication time between 1 and 20
- computation time between 10 and 1000



Anne.Benoit@ens-lyon.fr

### Experiment 4 - small computations

- communication time between 1 and 20
- computation time between 0.01 and 10



## Experiment 4 - small computations

- communication time between 1 and 20
- computation time between 0.01 and 10



Anne.Benoit@ens-lyon.fr

# Summary of experiments

- Much more efficient than random mappings
- Three dominant heuristics for different cases
- Insignificant communications (hom. or small) and many processors: H5-BS121 (ONE-TO-ONE MAPPING)
- Insignificant communications (hom. or small) and few processors: H7b-BSC (binary search: clever choice where to split)
- Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors)

# Summary of experiments

- Much more efficient than random mappings
- Three dominant heuristics for different cases
- Insignificant communications (hom. or small) and many processors: H5-BS121 (ONE-TO-ONE MAPPING)
- Insignificant communications (hom. or small) and few processors: H7b-BSC (binary search: clever choice where to split)
- Important communications (het. or big): H6-SPL (splitting choice relevant for any number of processors)



- set of heuristics and experiments
- balanced comm/comp, het comm (Exp. 2)



## Outline



2 Working out an example

#### 3 Part 1 - Communications, monolithic stages, mono-criterion

Part 2 - Simpler model with no communications, but with replication/DP and bi-criteria

## 5 Conclusion



- Pipeline graph
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, with communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- Different mapping strategies
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- $\bullet \ \mbox{Interval Mapping only}$
- Only monolithic stages: no replication nor data-parallelism
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- $\bullet \ \mbox{Interval Mapping only}$
- Replicable stages, and either data-parallelism or not
- Mono-criterion: period minimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- $\bullet \ \mbox{Interval Mapping only}$
- Replicable stages, and either data-parallelism or not
- Bi-criteria optimization
- Complexity results, heuristics and experiments



- Pipeline and fork graphs
- Different platforms, without communications
- $\bullet \ \mbox{Interval Mapping only}$
- Replicable stages, and either data-parallelism or not
- Bi-criteria optimization
- Complexity results only

#### Without data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline |            | -         |             |
| Het. pipeline | Poly (str) |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

#### With data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline |            | -         |             |
| Het. pipeline | Poly (DP)  |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

#### Without data-parallelism, Heterogeneous platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     |              | Poly (*)   |             |
| Het. fork     | NP-hard      | -          | -           |

#### With data-parallelism, Heterogeneous platforms

| Objective     | period | latency | bi-criteria |
|---------------|--------|---------|-------------|
| Hom. pipeline |        | NP-har  | d           |
| Het. pipeline |        | -       |             |
| Hom. fork     |        | NP-har  | d           |
| Het. fork     |        | -       |             |

.⊒ . ►

#### Most interesting case:

#### Without data-parallelism, Heterogeneous platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     |              | Poly (*)   |             |
| Het. fork     | NP-hard      | -          | -           |

## No data-parallelism, Heterogeneous platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline

## No data-parallelism, Heterogeneous platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline

## Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_q$ . There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \leq r \leq k \leq q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \leq r < k$ .

Proof: exchange argument, which does not increase latency

## Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_q$ . There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \leq r \leq k \leq q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \leq r < k$ .

Proof: exchange argument, which does not increase latency

通 と く ヨ と く ヨ と

# Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency

# Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

## Dynamic programming algorithm

 Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P<sub>i</sub> to P<sub>j</sub>, while fitting in period K.

$$L(m,i,j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \le K \quad (1) \\ L(m',i,k) + L(m-m',k+1,j) \quad (2) \end{cases}$$

Case (1): replicating *m* stages onto processors P<sub>i</sub>, ..., P<sub>j</sub>
Case (2): splitting the interval

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

# Dynamic programming algorithm

 Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P<sub>i</sub> to P<sub>j</sub>, while fitting in period K.

$$L(m,i,j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \le K \quad (1) \\ L(m',i,k) + L(m-m',k+1,j) \quad (2) \end{cases}$$

Initialization:

$$L(1, i, j) = \begin{cases} \frac{w}{s_i} & \text{if } \frac{w}{(j-i).s_i} \le K \\ +\infty & \text{otherwise} \end{cases}$$
$$L(m, i, i) = \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{s_i} \le K \\ +\infty & \text{otherwise} \end{cases}$$

Introduction Framework Example Part 1 - Coms, No Rep/DP, 1c Part 2 - No coms, Rep/DP, 2c Conclusion

## Dynamic programming algorithm

 Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors P<sub>i</sub> to P<sub>j</sub>, while fitting in period K.

$$L(m,i,j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \left\{ \begin{array}{l} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \le K \quad (1) \\ L(m',i,k) + L(m-m',k+1,j) \quad (2) \end{array} \right.$$

- Complexity of the dynamic programming:  $O(n^2.p^4)$
- Number of iterations of the binary search formally bounded, very small number of iterations in practice.

## Outline



- 2 Working out an example
- 3 Part 1 Communications, monolithic stages, mono-criterion
- Part 2 Simpler model with no communications, but with replication/DP and bi-criteria



∃ ▶ ∢

### Related work

Subhlok and Vondran- Extension of their work (pipeline on hom platforms)

Chains-to-chains- In our work possibility to replicate or data-parallelize

Mapping pipelined computations onto clusters and grids- DAG [Taura et al.], DataCutter [Saltz et al.]

Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization

Mapping pipelined computations onto special-purpose architectures– FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.]

Mapping skeletons onto clusters and grids– Use of stochastic process algebra [Benoit et al.]

A B F A B F

## Conclusion

Theoretical side Complexity results for several cases Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages

Practical side

- Optimal polynomial algorithms, heuristics for NP-hard instances of the problem
- Experiments: Comparison of heuristics performance
- Linear program to assess the absolute performance of the heuristics, which turns out to be quite good

## Also in the pipeline

#### **Bi-criteria**

- Several heuristics and experiments not detailed in this talk
- Bi-criteria linear program
- Real experiments on a JPEG encoder pipeline application

#### Three-criteria

- Introduction of failure probabilities to the model
- Replication for fault-tolerance vs replication for parallelism
  - compute several time the same data-set in case of failure
  - uses more resources and does not decrease period or latency
  - three objectives: min latency and period, max reliability
- Complexity analysis

### Future work

#### Short term

- Heuristics for *Fully Heterogeneous* platforms and other NP-hard instances of the problem
- Extension to DAG-trees (a DAG which is a tree when un-oriented)

#### Longer term

- Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels
- Real experiments on heterogeneous clusters, using an already-implemented skeleton library and MPI
- Comparison of effective performance against theoretical performance

(3)

# Open problems

#### • Energy savings

- processors that can run at different frequencies
- trade-off between energy consumption and speed
- Simultaneous execution of several (concurrent) workflows
  - competition for CPU and network resources
  - fairness between applications (stretch)
  - sensitivity to application/platform parameter changes