#use wml::std::lang
#use wml::fmt::isolatin

#include <menu-item.wml>
#include <bibstyle.wml>
#include <banner.wml>  title="Research" path="research"
<br>
<center>


<table border=0 cellspacing=0 cellpadding=5 width="98%">
<tr><td>

<frame title="Documents">
<table border=0 cellspacing=0 cellpadding=10 width="100%">
<tr valign=top><td valign=top width="100%">

<ul>
<li> a list of my <a href="publications.html">publications</a>
(journals, conference's articles and freely available research reports)
<li> some material from the <a href="talks.html">talks</a> I have given
</ul>

</td></tr></table>
</frame>

</td></tr>
</table>






<table border=0 cellspacing=0 cellpadding=5 width="98%">
<tr valign=top><td valign=top width="50%">

<frame title="Pipelining Collective Communications">
<table border=0 cellspacing=0 cellpadding=10 width="100%">
<tr valign=top><td valign=top width="100%">

<h3>Motivations and Principles</h3>
<p align="justify">
Recently, the advances in high performance networks and in
communication library dedicated to distributed computing make it
possible to use heterogeneous network of workstations instead of
dedicated expensive supercomputers such as massively parallel
calculators. However, even if it is now possible to obtain tremendous
performances on a given network, it is difficult to keep these
performances as soon as we gather several networks (in the case of Grid
computing). This is mainly due to the heterogeneity of the obtained
platform, and the effect of bottlenecks. Moreover, discovering the
topology of the computing platform is often challenging, although it
is crucial for the efficient deployment of grid applications.

Considering the communications involved in such distributed
applications, we can gather them into some macro-communication
schemes, such as broadcasts, scatters, all-to-all or reduce
operations. These macro-communication schemes have often been studied
with the goal of minimizing their makespan, i.e. the time elapsed
between the emission of the first message by the source, and the last
reception. But in many cases, the application has to perform a large
number of instances of the same operation (for example if data
parallelism is used). When dealing with such a series of
macro-communications, pipelining is mandatory to achieve good
performance. The relevant objective becomes the optimization of the
throughput, that is the average number of macro-communications executed
per time-unit. </p>

<h3>Results</h3>
<p align="justify">
We have studied these collective communications in the context of
throughput maximization, and thus designed scheduling strategies with
 optimal throughput (leading to asymptotically optimal schedules with
respect to the classical make-span metric) for most of the classical
collective communication primitives. Here is a summary of our work in
this field:</p>

<ul>

<li> <p align="justify">
We first studied the Scatter and Reduce operations, and developed
a new framework for studying this problem, using a relaxation based on
linear programming. This results in an <a
href="publications.html#apdcm04-scatter">article in APDCM'04</a>. An <a
href="publications.html#jpdc05"> extended journal version (in
JPDC)</a> will be soon available.</p>

<li><p align="justify">
 Then, we moved to the study of the most famous collective
communication schemes: Broadcast and Multicast. As replication of the
messages is allowed for these communications (and necessary to reach
good performances), these problems are more challenging than the
previous ones. However, using graph techniques, we are able to design
the same kind of schedules (optimal with the respect of the throughput
metric) for the broadcast problem (see the <a
href="publications.html#ipdps04-broadcast">article in IPDPS'04</a>, and
the <a href="publications.html#tpds04">extended version in the IEEE
TPDS journal</a>). The complexity of Multicast operations turns out to be
more difficult: we have shown that this is in fact an NP-complete
problem (see the <a href="publications.html#multicast04-icpp">article
in ICPP'04)</a>.</p>

<li> <p align="justify"> 
All previous studies assume a bidirectional one-port model: at each
time-step, each machine is available to perform at most one sending
and one receiving communications. Although this model is widely
accepted, we tried to adapt our results to the unidirectional one-port
model, when a machine cannot perform simultaneously a sending and a
receiving operations. Surprisingly, even the Scatter problem turns out
to be much more complicated than in the bidirectional case, but we
design an approach based on the ellipsoid method for solving big (non
polynomial) linear problems (this technical method is available as a
<a href="publications.html#rr2004-32">research report</a> and in our
<a href="publications.html#ijhpca06-1">IJHPCA paper</a> ).  
</p>


<li><p align="justify">
 As previous approach for solving problems such as Broadcasting a
series of messages may lead to a big control overhead (as we are using
several concurrent broadcast trees), we have designed some heuristics
to get good performances while keeping the control complexity low
(that is using a single broadcast tree). We present this work using a
large range of platform models
 in the <a href = "publications.html#ipdps05-broadcast">article in
IPDPS'05</a>.</p>

</ul>


</td></tr></table>
</frame>

</td>
<td valign=top width="50%">

<frame title="Other scheduling issues">
<table border=0 cellspacing=0 cellpadding=10 width="100%">
<tr valign=top><td valign=top width="100%">

<h3> Steady-State Relaxation</h3>

<p align="justify">
The idea of relaxation makespan minimization problems using linear
programming techniques was also applied to some classical scheduling
problems. <br>
 We studied the problem of scheduling series (or stream) of identical
DAGs with different input data on heterogeneous platforms, and found a
way to compute a periodic throughput reaching optimal throughput for
simple DAGs (that is with bounded depth of dependencies). This work is
available as a <a href="publications.html#ppl03">journal article in
PPL</a>. This limitation to bounded dependencies is not articifial: we
have shown that the general problem is NP-complete: see the
corresponding <a href="publications.html#rr2004-20">research
report</a> for a complete (and rather technical) view, of the <a
href="publications.html#heteropar04-steady-state">article in
HeteroPar'04</a> which summarizes the impact and limits of this
steady-state approach.  <p>




<h3> From steady state theory to concrete scheduling problems </h3>

The previous study of steady-state scheduling has been extended to
several problems, which are briefly stated here:




 <p align="justify"> <b> Independant tasks scheduling </b>
 In an <a href="publications.html#pdp05-buffer">article presented
at PDP'05</a>, we have studied the impact of introducing memory
constraints to the classical problems of scheduling independent tasks
on a master-slave platform. </p>

 <p align="justify"> <b> Scheduling divisible load applications on a large-scale network</b>
Using a new model for the communications
between sites in a Grid computing environment, we have investigated a
new framework for scheduling divisible load on such a platform. This
work is available as a <a href="publications.html#rr2004-21">research
report</a> and an <a
href="publications.html#ipdps05-divisible">IPDPS'05 article</a>.
</p>

<p align="justify"> <b> Scheduling multiple bag-of-tasks applications
</b> We have studied the problem of scheduling multiple applications
consisting in a large number of same-size independent tasks. The
steady-state approach gives an optimal schedule, but need all the
information on the platform to be gathered at the master. Thus we have
investigated heuristics to give a distributed answer to this
problem. This has been presented <a
href="publications.html#ipdps06_msma">IPDPS 2006 (slides available in
the <a href="talks.html">talks</a> section)</p>



<h3> Simulation </h3>

<p align="justify">
When dealing with scheduling issues on distributed platforms, the
question of the justifications of the algorithms and heuristics proposed
is a key problem.<br>

 One approach is to perform experiments with real applications on real
resources. However, modern computing platforms are increasingly
distributed and often span multiple administrative domains. Therefore,
resource availability fluctuations make it impossible to conduct
repeatable experiments for relatively long running
applications. Another problem is that the number of platform
configurations that can be explored is limited. As a results of these
difficulties with real experimentations, most researchers have
resorted to discrete-event simulation.<br>

The main advantage of using simulation is the ability to compare
several algorithms on a wide variety of heterogeneous platforms
(thanks to fast execution of simulations) ensuring the reproducibility
of measured data, which is not possible on a distributed platform
which is not dedicated to the experiments.<br>

These are the motivations for the development of the <a
href="http://gcl.ucsd.edu/simgrid/">SimGrid</a> toolkit, which allows
users to simulate distributed algorithms in a heterogeneous
distributed environment. I have participated in the design of a new
network model for this simulator. The technical results of this work
can be found in the corresponding <a
href="publications.html#rr2002-40">research report</a>, while an <a
href="publications.html#ccgrid03-simgrid">article in CCGrid'03</a>
presents the second version of this simulator (which takes new
model into account).
</p>




</td></tr></table>
</frame>

</td></tr>

</table>






</center>