# Complexity results for throughput and latency optimization of replicated and data-parallel workflows

#### Anne Benoit and Yves Robert

GRAAL team, LIP École Normale Supérieure de Lyon

June 2007



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



#### Introduction and motivation

- Mapping workflow applications onto parallel platforms
   Difficult challenge
- Heterogeneous clusters, fully heterogeneous platforms
   Even more difficult!
- Structured programming approach
  - Easier to program (deadlocks, process starvation)
  - Range of well-known paradigms (pipeline, farm)
  - Algorithmic skeleton: help for mapping



# Rule of the game

- Consecutive data-sets fed into the workflow
- Period  $T_{\text{period}}$  = time interval between beginning of execution of two consecutive data sets (throughput=1/ $T_{\text{period}}$ )
- Latency  $T_{\text{latency}}(x) = \text{time elapsed between beginning and}$ end of execution for a given data set x, and  $T_{\text{latency}} = \max_{x} T_{\text{latency}}(x)$
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{period}$  or  $T_{latency}$  or bi-criteria minimization



# Rule of the game

- Consecutive data-sets fed into the workflow
- Period  $T_{\text{period}}$  = time interval between beginning of execution of two consecutive data sets (throughput=1/ $T_{\text{period}}$ )
- Latency  $T_{\text{latency}}(x) = \text{time elapsed between beginning and}$ end of execution for a given data set x, and  $T_{\text{latency}} = \max_{x} T_{\text{latency}}(x)$
- Map each pipeline/fork stage on one or several processors
- Goal: minimize  $T_{period}$  or  $T_{latency}$  or bi-criteria minimization



## Replication and data-parallelism

**Replicate** stage  $S_k$  on  $P_1, \ldots, P_q$ 

$$\mathcal{S}_k$$
 on  $P_1$ : data sets 1, 4, 7, ...   
 ...  $\mathcal{S}_{k-1}$  --  $\mathcal{S}_k$  on  $P_2$ : data sets 2, 5, 8, ... --  $\mathcal{S}_{k+1}$  ...   
  $\mathcal{S}_k$  on  $P_3$ : data sets 3, 5, 9, ...

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_q$ 

$$S_k$$
  $(w = 16)$   $P_1$   $(s_1 = 2)$ : •••••

 $P_2$   $(s_2 = 1)$ : •••

 $P_3$   $(s_3 = 1)$ : •••



# Replication and data-parallelism

**Replicate** stage  $S_k$  on  $P_1, \ldots, P_q$ 

$$\mathcal{S}_k$$
 on  $P_1$ : data sets 1, 4, 7, ...   
 ...  $\mathcal{S}_{k-1}$  --  $\mathcal{S}_k$  on  $P_2$ : data sets 2, 5, 8, ... --  $\mathcal{S}_{k+1}$  ...   
  $\mathcal{S}_k$  on  $P_3$ : data sets 3, 5, 9, ...

**Data-parallelize** stage  $S_k$  on  $P_1, \ldots, P_q$ 



# Major contributions

- Complexity results for throughput and latency optimization of replicated and data-parallel workflows
- Theoretical approach to the problem
  - definition of replication and data-parallelism
  - $\bullet$  formal definition of  $T_{\text{period}}$  and  $T_{\text{latency}}$  in each case
- Problem complexity: focus on pipeline and fork workflows

# Major contributions

- Complexity results for throughput and latency optimization of replicated and data-parallel workflows
- Theoretical approach to the problem
  - definition of replication and data-parallelism
  - ullet formal definition of  $T_{\text{period}}$  and  $T_{\text{latency}}$  in each case
- Problem complexity: focus on pipeline and fork workflows

#### Outline

- Framework
- Working out an example
- 3 The problem
- 4 Complexity results
- Conclusion



#### Outline

- Tramework
- Working out an example
- The problem
- 4 Complexity results
- Conclusion



# Pipeline graphs



- n stages  $S_k$ ,  $1 \le k \le n$
- $\circ$   $\mathcal{S}_k$ :
  - receives input of size  $\delta_{k-1}$  from  $\mathcal{S}_{k-1}$
  - performs w<sub>k</sub> computations
  - outputs data of size  $\delta_k$  to  $\mathcal{S}_{k+1}$

## Fork graphs



- n+1 stages  $S_k$ ,  $0 \le k \le n$ 
  - $S_0$ : root stage
  - $S_1$  to  $S_n$ : independent stages
- A data set goes through stage  $S_0$ , then it can be executed simultaneously for all other stages



## The platform



- p processors  $P_u$ ,  $1 \le u \le p$ , fully interconnected
- $s_u$ : speed of processor  $P_u$
- bidirectional link link $u,v:P_u\to P_v$ , bandwidth  $b_{u,v}$
- one-port model: each processor can either send, receive or compute at any time-step



# Different platforms

Fully Homogeneous – Identical processors ( $s_u = s$ ) and links ( $b_{u,v} = b$ ): typical parallel machines

Communication Homogeneous – Different-speed processors  $(s_u \neq s_v)$ , identical links  $(b_{u,v} = b)$ : networks of workstations, clusters

Fully Heterogeneous – Fully heterogeneous architectures,  $s_u \neq s_v$  and  $b_{u,v} \neq b_{u',v'}$ : hierarchical platforms, grids











## Back to pipeline: mapping strategies



In this work, Interval Mapping

#### Chains-on-chains

Load-balance contiguous tasks

5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6



#### Chains-on-chains

Load-balance contiguous tasks

5 7 3 4 8 1 3 8 2 9 7 3 5 2 3 6

With p = 4 identical processors?

#### Chains-on-chains

Load-balance **contiguous** tasks

With p = 4 identical processors?

$$T_{\rm period} = 20$$

#### Chains-on-chains

Load-balance contiguous tasks

With p = 4 identical processors?

5 7 3 4 | 8 1 3 8 | 2 9 7 | 3 5 2 3 6 
$$T_{\sf period} = 20$$

NP-hard for different-speed processors, even without communications



#### Outline

- Framework
- 2 Working out an example
- 3 The problem
- 4 Complexity results
- Conclusion



Workflows complexity results

Interval mapping, 4 processors,  $s_1=2$  and  $s_2=s_3=s_4=1$ 

#### Optimal period?

$$T_{
m period}=7$$
,  $\mathcal{S}_1 o P_1$ ,  $\mathcal{S}_2\mathcal{S}_3 o P_2$ ,  $\mathcal{S}_4 o P_3$  ( $T_{
m latency}=17$ )

#### Optimal latency?

$$T_{\text{latency}} = 12$$
,  $S_1 S_2 S_3 S_4 \rightarrow P_1$  ( $T_{\text{period}} = 12$ )

Min. latency if 
$$T_{\text{period}} \leq 10$$
?  
 $T_{\text{latency}} = 14$ ,  $S_1 S_2 S_3 \rightarrow P_1$ ,  $S_4 \rightarrow P_2$ 



Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

$$T_{\text{period}} = 7$$
,  $\mathcal{S}_1 
ightarrow P_1$ ,  $\mathcal{S}_2 \mathcal{S}_3 
ightarrow P_2$ ,  $\mathcal{S}_4 
ightarrow P_3$  ( $T_{\text{latency}} = 17$ )

#### Optimal latency?

$$T_{\text{latency}} = 12$$
,  $S_1 S_2 S_3 S_4 \rightarrow P_1$  ( $T_{\text{period}} = 12$ )

Min. latency if 
$$T_{\text{period}} \leq 10$$
?  
 $T_{\text{latency}} = 14$ ,  $S_1 S_2 S_3 \rightarrow P_1$ ,  $S_4 \rightarrow P_2$ 



Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

$$T_{\mathsf{period}} = \mathsf{7}$$
,  $\mathcal{S}_1 o P_1$ ,  $\mathcal{S}_2 \mathcal{S}_3 o P_2$ ,  $\mathcal{S}_4 o P_3$   $(T_{\mathsf{latency}} = \mathsf{17})$ 

#### Optimal latency?

$$T_{\mathsf{latency}} = 12, \; \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \to P_1 \; (T_{\mathsf{period}} = 12)$$

Min. latency if  $T_{\text{period}} \leq 10$ ?

 $T_{\text{latency}} = 14, \ \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \rightarrow P_1, \ \mathcal{S}_4 \rightarrow P_2$ 



Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

$$T_{\mathsf{period}} = \mathsf{7}, \; \mathcal{S}_1 o P_1, \; \mathcal{S}_2 \mathcal{S}_3 o P_2, \; \mathcal{S}_4 o P_3 \; ig( T_{\mathsf{latency}} = \mathsf{17} ig)$$

#### Optimal latency?

$$T_{\mathsf{latency}} = 12, \; \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 o P_1 \; (T_{\mathsf{period}} = 12)$$

Min. latency if 
$$T_{\text{period}} \leq 10$$
?

$$T_{\text{latency}} = 14, \ \mathcal{S}_1 \mathcal{S}_2 \mathcal{S}_3 \rightarrow P_1, \ \mathcal{S}_4 \rightarrow P_2$$



## Example with replication and data-parallelism

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

**Replicate** interval  $[S_u..S_v]$  on  $P_1, ..., P_q$ 

$$\mathcal{S}_u \dots \mathcal{S}_v$$
 on  $P_1$ : data sets 1, 4, 7, ...  $\mathcal{S}_v \dots \mathcal{S}_v \dots \mathcal{S}_v$  on  $P_2$ : data sets 2, 5, 8, ...  $--\mathcal{S}_v \dots \mathcal{S}_v \dots \mathcal{S}_v \dots \mathcal{S}_v$  on  $P_3$ : data sets 3, 5, 9, ...

$$T_{\text{period}} = \frac{\sum_{k=u}^{v} w_k}{q \times \min(s_i)}$$
 and  $T_{\text{latency}} = q \times T_{\text{period}}$ 

## Example with replication and data-parallelism

Interval mapping, 4 processors,  $s_1=2$  and  $s_2=s_3=s_4=1$ 

**Data Parallelize** single stage  $S_k$  on  $P_1, \ldots, P_q$ 

$$T_{
m period} = rac{{\sf w}_k}{\sum_{i=1}^q {\sf s}_i}$$
 and  $T_{
m latency} = T_{
m period}$ 

## Example with replication and data-parallelism

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

Optimal period?



## Example with replication and data-parallelism

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

$$\mathcal{S}_1 \stackrel{\mathrm{DP}}{\rightarrow} P_1 P_2$$
,  $\mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \stackrel{\mathrm{REP}}{\rightarrow} P_3 P_4$ 

$$T_{\mathsf{period}} = \mathsf{max}(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5$$
,  $T_{\mathsf{latency}} = 14.67$ 

## Example with replication and data-parallelism

Interval mapping, 4 processors,  $\mathsf{s}_1=2$  and  $\mathsf{s}_2=\mathsf{s}_3=\mathsf{s}_4=1$ 

#### Optimal period?

$$\mathcal{S}_1 \overset{\mathrm{DP}}{\to} P_1 P_2$$
,  $\mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \overset{\mathrm{REP}}{\to} P_3 P_4$ 
 $T_{\mathsf{period}} = \mathsf{max}(\frac{14}{2+1}, \frac{4+2+4}{2\times 1}) = 5$ ,  $T_{\mathsf{latency}} = 14.67$ 
 $\mathcal{S}_1 \overset{\mathrm{DP}}{\to} P_2 P_3 P_4$ ,  $\mathcal{S}_2 \mathcal{S}_3 \mathcal{S}_4 \to P_1$ 
 $T_{\mathsf{period}} = \mathsf{max}(\frac{14}{1+1+1}, \frac{4+2+4}{2}) = 5$ ,  $T_{\mathsf{latency}} = 9.67$  (optimal)



## Outline

- Framework
- Working out an example
- 3 The problem
- 4 Complexity results
- Conclusion



## Interval Mapping for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $l_j = [d_j, e_j]$  (with  $d_j \leq e_j$  for  $1 \leq j \leq m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \leq j \leq m-1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{alloc(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$



## INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $l_j = [d_j, e_j]$  (with  $d_j \leq e_j$  for  $1 \leq j \leq m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \leq j \leq m-1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{dj-1}}{\mathsf{b}_{\mathsf{alloc}(j-1),\mathsf{alloc}(j)}} + \frac{\sum_{i=d_j}^{\mathsf{e}_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{\mathsf{e}_j}}{\mathsf{b}_{\mathsf{alloc}(j),\mathsf{alloc}(j+1)}} \right\}$$



## INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{\mathsf{e}_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{\mathsf{e}_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{dj-1}}{\mathsf{b}_{\mathsf{alloc}(j-1),\mathsf{alloc}(j)}} + \frac{\sum_{i=d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j),\mathsf{alloc}(j+1)}} \right\}$$

◆ロト ◆母 ト ◆ 恵 ト ◆ 恵 ・ 夕 Q (\*)

## INTERVAL MAPPING for pipeline graphs

- Several consecutive stages onto the same processor
- Increase computational load, reduce communications
- Partition of [1..n] into m intervals  $I_j = [d_j, e_j]$  (with  $d_j \le e_j$  for  $1 \le j \le m$ ,  $d_1 = 1$ ,  $d_{j+1} = e_j + 1$  for  $1 \le j \le m 1$  and  $e_m = n$ )
- ullet Interval  $I_j$  mapped onto processor  $P_{\mathsf{alloc}(j)}$

$$T_{\mathsf{period}} = \max_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{\mathsf{e}_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{\mathsf{e}_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$

$$T_{\mathsf{latency}} = \sum_{1 \leq j \leq m} \left\{ \frac{\delta_{d_j - 1}}{\mathsf{b}_{\mathsf{alloc}(j - 1), \mathsf{alloc}(j)}} + \frac{\sum_{i = d_j}^{e_j} \mathsf{w}_i}{\mathsf{s}_{\mathsf{alloc}(j)}} + \frac{\delta_{e_j}}{\mathsf{b}_{\mathsf{alloc}(j), \mathsf{alloc}(j + 1)}} \right\}$$



## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- •
- -



## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- $T_{\text{period}} = ?$
- 9

## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- $T_{\text{period}} = ?$
- depends on the com model: is it possible to start com as soon as  $S_0$  is done? Which order for com?

•



## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- Informally:  $T_{period} = max$  time needed by processor to receive data, compute, output result
- •
- •

## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- Informally:  $T_{period} = max$  time needed by processor to receive data, compute, output result
- $T_{\text{latency}} = \text{time elapsed between data set input to } \mathcal{S}_0$  until last computation for this data set is completed

•



## Fork graphs

- map any partition of the graph onto the processors
- q intervals,  $q \leq p$
- ullet first interval:  $\mathcal{S}_0$  and possibly  $\mathcal{S}_1$  to  $\mathcal{S}_k$
- next intervals of independent stages
- Informally:  $T_{period} = max$  time needed by processor to receive data, compute, output result
- $T_{\text{latency}} = \text{time elapsed between data set input to } \mathcal{S}_0$  until last computation for this data set is completed
- Simpler model for formal analysis



## Back to a simpler problem

- No communication costs nor overheads
- •
- •
- •
- •



Workflows complexity results

## Back to a simpler problem

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:

$$\frac{W_i}{S_{ij}}$$

- •
- •
- 0

## Back to a simpler problem

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s_u}$
- Cost to data-parallelize  $[S_i, S_j]$   $(i = j \text{ for pipeline; } 0 < i \le j \text{ or } i = j = 0 \text{ for fork) on } k \text{ processors } P_{q_1}, \dots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{\sum_{u=1}^{k} \mathsf{s}_{q_{u}}}.$$

Cost =  $T_{period}$  of assigned processors Cost = delay to traverse the interval

- •
- •

## Back to a simpler problem

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s_u}$
- Cost to data-parallelize
- Cost to replicate  $[S_i, S_j]$  on k processors  $P_{q_1}, \ldots, P_{q_k}$ :

$$\frac{\sum_{\ell=i}^{j} \mathsf{w}_{\ell}}{k \times \mathsf{min}_{1 \leq u \leq k} \, \mathsf{s}_{q_{u}}}.$$

Cost =  $T_{period}$  of assigned processors Delay to traverse the interval = time needed by slowest processor:

$$t_{\mathsf{max}} = \frac{\sum_{\ell=i}^{j} \mathsf{W}_{\ell}}{\min_{1 \leq u \leq k} \mathsf{S}_{au}}$$

## Back to a simpler problem

- No communication costs nor overheads
- Cost to execute  $S_i$  on  $P_u$  alone:  $\frac{w_i}{s_u}$
- Cost to data-parallelize
- Cost to replicate
- With these formulas: easy to compute  $T_{\text{period}}$  for both graphs, and  $T_{\text{latency}}$  for pipeline graphs



- partition of stages into q sets  $\mathcal{I}_r$   $(1 \le r \le q \le p)$
- ullet  $\mathcal{S}_0 \in \mathcal{I}_1$ , to k processors  $P_{q_1}, \dots, P_{q_k}$
- $t_{\max}(r) = \text{delay of } r\text{-th set } (1 \leq r \leq q)$ , computed as before
- flexible com model: computations of  $\mathcal{I}_r$ ,  $r \geq 2$ , start as soon as computation of  $\mathcal{S}_0$  is completed.
- $\bullet$   $s_0=$  speed at which  $\mathcal{S}_0$  is processed:
  - $s_0 = \sum_{u=1}^k s_{q_u}$  if  $\mathcal{I}_1$  is data-parallelized
  - $\mathsf{s}_0 = \min_{1 \leq u \leq k} \mathsf{s}_{q_u}$  if  $\mathcal{I}_1$  is replicated

$$T_{\text{latency}} = \max \left( t_{\text{max}}(1), \frac{\mathsf{w}_0}{\mathsf{s}_0} + \max_{2 \le r \le q} t_{\text{max}}(r) \right)$$

- partition of stages into q sets  $\mathcal{I}_r$   $(1 \le r \le q \le p)$
- ullet  $\mathcal{S}_0 \in \mathcal{I}_1$ , to k processors  $P_{q_1}, \dots, P_{q_k}$
- ullet  $t_{\sf max}(r)=$  delay of r-th set  $(1\leq r\leq q)$ , computed as before
- flexible com model: computations of  $\mathcal{I}_r$ ,  $r \geq 2$ , start as soon as computation of  $\mathcal{S}_0$  is completed.
- $s_0 = \text{speed at which } S_0 \text{ is processed:}$ 
  - $s_0 = \sum_{u=1}^k s_{q_u}$  if  $\mathcal{I}_1$  is data-parallelized
  - $s_0 = \min_{1 \leq u \leq k} s_{q_u}$  if  $\mathcal{I}_1$  is replicated

$$T_{\text{latency}} = \max \left( t_{\text{max}}(1), \frac{\mathsf{w}_0}{\mathsf{s}_0} + \max_{2 \le r \le q} t_{\text{max}}(r) \right)$$



- partition of stages into q sets  $\mathcal{I}_r$   $(1 \le r \le q \le p)$
- ullet  $\mathcal{S}_0 \in \mathcal{I}_1$ , to k processors  $P_{q_1}, \dots, P_{q_k}$
- $t_{\sf max}(r) = {\sf delay} \ {\sf of} \ r\text{-th set} \ (1 \leq r \leq q)$ , computed as before
- flexible com model: computations of  $\mathcal{I}_r$ ,  $r \geq 2$ , start as soon as computation of  $\mathcal{S}_0$  is completed.
- ullet  $s_0 =$  speed at which  $\mathcal{S}_0$  is processed:
  - ullet  ${\sf s}_0 = \sum_{u=1}^k {\sf s}_{q_u}$  if  ${\cal I}_1$  is data-parallelized
  - ullet  $s_0 = \min_{1 \leq u \leq k} s_{q_u}$  if  $\mathcal{I}_1$  is replicated

$$T_{\text{latency}} = \max\left(t_{\text{max}}(1), \frac{\mathsf{w}_0}{\mathsf{s}_0} + \max_{2 \le r \le q} t_{\text{max}}(r)\right)$$



- partition of stages into q sets  $\mathcal{I}_r$   $(1 \le r \le q \le p)$
- ullet  $\mathcal{S}_0 \in \mathcal{I}_1$ , to k processors  $P_{q_1}, \ldots, P_{q_k}$
- ullet  $t_{\sf max}(r) = {\sf delay} \ {\sf of} \ r\text{-th set} \ (1 \leq r \leq q), \ {\sf computed} \ {\sf as} \ {\sf before}$
- flexible com model: computations of  $\mathcal{I}_r$ ,  $r \geq 2$ , start as soon as computation of  $\mathcal{S}_0$  is completed.
- ullet  $s_0 =$  speed at which  $\mathcal{S}_0$  is processed:
  - ullet  ${\sf s}_0 = \sum_{u=1}^k {\sf s}_{q_u}$  if  ${\cal I}_1$  is data-parallelized
  - ullet  $\mathsf{s}_0 = \mathsf{min}_{1 \leq u \leq k} \, \mathsf{s}_{q_u}$  if  $\mathcal{I}_1$  is replicated

•

$$T_{\mathsf{latency}} = \mathsf{max}\left(t_{\mathsf{max}}(1), \frac{\mathsf{w}_0}{\mathsf{s}_0} + \max_{2 \le r \le q} t_{\mathsf{max}}(r)\right)$$



## Optimization problem

#### Given

- an application graph (n-stage pipeline or (n + 1)-stage fork),
- a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors),
- a mapping strategy with replication, and either with data-parallelization or without,
- an objective (period  $T_{period}$  or latency  $T_{latency}$ ),

determine an interval-based mapping that minimizes the objective 16 optimization problems

## Optimization problem

#### Given

- ullet an application graph (n-stage pipeline or (n+1)-stage fork),
- a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors),
- a mapping strategy with replication, and either with data-parallelization or without,
- an objective (period  $T_{period}$  or latency  $T_{latency}$ ),

determine an interval-based mapping that minimizes the objective 16 optimization problems

## Optimization problem

#### Given

- an application graph (n-stage pipeline or (n + 1)-stage fork),
- a target platform (Homogeneous with p identical processors or Heterogeneous with p different-speed processors),
- a mapping strategy with replication, and either with data-parallelization or without,
- an objective (period  $T_{period}$  or latency  $T_{latency}$ ),

determine an interval-based mapping that minimizes the objective 16 optimization problems

## Bi-criteria optimization problem

- given threshold period  $\mathcal{P}_{\text{threshold}}$ , determine mapping whose period does not exceed  $\mathcal{P}_{\text{threshold}}$  and that minimizes  $\mathcal{T}_{\text{latency}}$
- given threshold latency  $\mathcal{L}_{\text{threshold}}$ , determine mapping whose latency does not exceed  $\mathcal{L}_{\text{threshold}}$  and that minimizes  $T_{\text{period}}$

## Outline

- 1 Framework
- Working out an example
- 3 The problem
- 4 Complexity results
- Conclusion



# Complexity results

Without data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline |            | -         |             |
| Het. pipeline | Poly (str) |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

# Complexity results

With data-parallelism, Homogeneous platforms

| Objective     | period     | latency   | bi-criteria |
|---------------|------------|-----------|-------------|
| Hom. pipeline |            | -         |             |
| Het. pipeline | Poly (DP)  |           |             |
| Hom. fork     | -          | Poly (DP) |             |
| Het. fork     | Poly (str) | NP-hard   |             |

# Complexity results

Without data-parallelism, Heterogeneous platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     |              | Poly (*)   |             |
| Het. fork     | NP-hard      | -          | -           |

# Complexity results

With data-parallelism, Heterogeneous platforms

| Objective     | period | latency | bi-criteria |
|---------------|--------|---------|-------------|
| Hom. pipeline |        | NP-har  | d           |
| Het. pipeline |        | -       |             |
| Hom. fork     |        | NP-har  | d           |
| Het. fork     |        | -       |             |

# Complexity results

# Most interesting case: Without data-parallelism, *Heterogeneous* platforms

| Objective     | period       | latency    | bi-criteria |
|---------------|--------------|------------|-------------|
| Hom. pipeline | Poly (*)     | -          | Poly (*)    |
| Het. pipeline | NP-hard (**) | Poly (str) | NP-hard     |
| Hom. fork     |              | Poly (*)   |             |
| Het. fork     | NP-hard      | -          | -           |

## No data-parallelism, Heterogeneous platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline



## No data-parallelism, Heterogeneous platforms

- For pipeline, minimizing the latency is straightforward: map all stages on fastest proc
- Minimizing the period is NP-hard (involved reduction similar to the heterogeneous chain-to-chain one) for general pipeline
- Homogeneous pipeline: all stages have same workload w: in this case, polynomial complexity.
- Polynomial bi-criteria algorithm for homogeneous pipeline



### Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_q$ .

There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \le r \le k \le q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \le r < k$ .

Proof: exchange argument, which does not increase latency



#### Lemma: form of the solution

Pipeline, no data-parallelism, Heterogeneous platform

#### Lemma

If an optimal solution which minimizes pipeline period uses q processors, consider q fastest processors  $P_1, ..., P_q$ , ordered by non-decreasing speeds:  $s_1 \leq ... \leq s_q$ .

There exists an optimal solution which replicates intervals of stages onto k intervals of processors  $I_r = [P_{d_r}, P_{e_r}]$ , with  $1 \le r \le k \le q$ ,  $d_1 = 1$ ,  $e_k = q$ , and  $e_r + 1 = d_{r+1}$  for  $1 \le r < k$ .

Proof: exchange argument, which does not increase latency



## Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency

## Binary-search/Dynamic programming algorithm

- Given latency L, given period K
- Loop on number of processors q
- Dynamic programming algorithm to minimize latency
- Success if L is obtained
- Binary search on L to minimize latency for fixed period
- Binary search on K to minimize period for fixed latency



## Dynamic programming algorithm

• Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors  $P_i$  to  $P_j$ , while fitting in period K.

$$L(m,i,j) = \min_{\substack{1 \leq m' < m \\ i \leq k < j}} \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{(j-i).s_i} \leq K \quad (1) \\ L(m',i,k) + L(m-m',k+1,j) \end{cases}$$
(2)

- Case (1): replicating m stages onto processors  $P_i, ..., P_j$
- Case (2): splitting the interval



## Dynamic programming algorithm

• Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors  $P_i$  to  $P_j$ , while fitting in period K.

$$L(m, i, j) = \min_{\substack{1 \le m' < m \\ i \le k < j}} \begin{cases} \frac{m \cdot w}{s_i} & \text{if } \frac{m \cdot w}{(j - i) \cdot s_i} \le K \\ L(m', i, k) + L(m - m', k + 1, j) \end{cases} (2)$$

#### Initialization:

$$L(1, i, j) = \begin{cases} \frac{w}{s_i} & \text{if } \frac{w}{(j-i).s_i} \le K \\ +\infty & \text{otherwise} \end{cases}$$

$$L(m, i, i) = \begin{cases} \frac{m.w}{s_i} & \text{if } \frac{m.w}{s_i} \le K \\ +\infty & \text{otherwise} \end{cases}$$



## Dynamic programming algorithm

• Compute L(n, 1, q), where L(m, i, j) = minimum latency to map m pipeline stages on processors  $P_i$  to  $P_j$ , while fitting in period K.

$$L(m,i,j) = \min_{\substack{1 \leq m' < m \\ i \leq k < j}} \begin{cases} \frac{m.w}{\mathsf{s}_i} & \text{if } \frac{m.w}{(j-i).\mathsf{s}_i} \leq K \quad (1) \\ L(m',i,k) + L(m-m',k+1,j) \end{cases} (2)$$

- Complexity of the dynamic programming:  $O(n^2.p^4)$
- Number of iterations of the binary search formally bounded, very small number of iterations in practice.



## Outline

- Framework
- Working out an example
- The problem
- 4 Complexity results
- Conclusion



## Related work

Subhlok and Vondran— Extension of their work (pipeline on hom platforms)

Chains-to-chains- In our work possibility to replicate or data-parallelize

Mapping pipelined computations onto clusters and grids— DAG [Taura et al.], DataCutter [Saltz et al.]

Energy-aware mapping of pipelined computations [Melhem et al.], three-criteria optimization

Mapping pipelined computations onto special-purpose architectures— FPGA arrays [Fabiani et al.]. Fault-tolerance for embedded systems [Zhu et al.]

Mapping skeletons onto clusters and grids— Use of stochastic process algebra [Benoit et al.]



## Conclusion

- Mapping structured workflow applications onto computational platforms, with replication and data-parallelism
- Complexity of the most tractable instances → insight of the combinatorial nature of the problem
- Pipeline and fork graphs, extension to fork-join
- Homogeneous and Heterogeneous platforms with no communications
- Minimizing period or latency, and bi-criteria optimization problems
- Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages



## Conclusion

- Mapping structured workflow applications onto computational platforms, with replication and data-parallelism
- Complexity of the most tractable instances → insight of the combinatorial nature of the problem
- Pipeline and fork graphs, extension to fork-join
- Homogeneous and Heterogeneous platforms with no communications
- Minimizing period or latency, and bi-criteria optimization problems
- Solid theoretical foundation for study of single/bi-criteria mappings, with possibility to replicate and data-parallelize application stages



## Future work

#### Short term

- Select polynomial instances of the problem and assess complexity when adding communication
- Design heuristics to solve combinatorial instances of the problem

#### Longer term

- Heuristics based on our polynomial algorithms for general application graphs structured as combinations of pipeline and fork kernels
- Real experiments on heterogeneous clusters
- Comparison of effective performance against theoretical performance

