Master 2 course for 2016-2017: CR02 - Resilient and energy-aware scheduling algorithms
Benoit; Courses Monday 13h15-15h15 room B1
Course notes and slides:
Sept. 15 at 10h15: The great scheduling zoo
(see the slides). If you were intrigued by
the problem of scheduling unrelated parallel machines, you can have
a look at this paper (but beware,
it is difficult).
Sept. 19: Divisible load scheduling
(see the slides). Todo: Prove the theorem on Slide 61, showing that a
multi-round schedule cannot improve an optimal single-round
schedule by more than a factor 2, before looking at the solution
If you want to see another
beautiful technique of relaxation so that scheduling problems become
tractable, you can have a look at steady-state scheduling,
for instance paper1
Sept. 26: Introduction to resilience: Faults and
MTBF, and checkpointing protocols (see the slides). Todo: Compute the value of T to minimize the waste. *** Homework for October 17 ***
Oct. 3: Probabilistic models to determine
the optimal checkpointing period, both in the coordinated protocol
and in the hierarchical one, and brief presentation of the double
checkpointing algorithm (see the slides). More details on the
failures, protocols and probabilistic models can be found in the first chapter of the
monograph "Fault tolerance techniques for high-performance
computing" (see here), and a
full study of hierachical protocols is available here.
Oct. 10: Advanced probabilistic models: analysis of the
algorithm, and of the impact of prediction
and replication (see the slides). More details in the
monograph, and details about the buddy algorithm here, analysis of prediction here, and prediction with intervals here. Todo: Figure out how to obtain the formula in the last
theorem, and don't forget to finish your homework for next week!
Oct. 11 at 13h15
Oct. 17: Wrapping up replication, and presentation of algorithm-based fault tolerance (ABFT) and a
composite approach. (see the slides).
Checkpointing a DAG: computing the expected execution
time and complexity results (see the slides). Todo: Think of the "join" case for DAGs: where does the
complexity comes from? Does the ordering of tasks matter?
In which order should we proceed if we have both checkpointed
and non-checkpointed tasks?
Oct. 24: Back on DAGs (NP-completeness for joins), and dealing with silent
errors (see the slides). *** Start thinking about your choice of paper(s) for the final
evaluation (see below) ***
Nov. 7: Energy-aware scheduling
algorithms (see the slides). Todo: Prove the lemma for slack reclaiming with continuous
speeds for fork and join graphs (see last slide).
The final evaluation for this course is based on a bibliographical
study of one or two research articles to be picked in the list below
(two articles for topics partially covered during the course). Each
student must choose a topic and the corresponding article(s) and notify the
teacher by mail (to Anne.Benoit at ens-lyon.fr, with a first-come
first-serve algorithm). As soon as a topic has been picked, it will be
indicated on this page. Each student then
has to write a report (4 to 8 pages) presenting and commenting the
and present it in class (presentation of 15 minutes
followed by 5 minutes of questions). The report must be sent at the
latest on Sunday December 11, and presentations will be spread between
December 12 and December 13.
The final grade will build upon the report, the presentation, and the
participation to questions. Also, each student will be asked to write
a short review (less than one page) on one of the other presentations,
once all presentations are done.
Large-scale distributed systems correspond to wide range of platforms, such as
high-performance supercomputers, sensor networks, volunteer computing grids, etc.
All these systems have a very large scale, and include millions of components.
Such a large scale induces two major problems: resilience and energy-consumption.
Resilience is (loosely) defined as surviving to failures.
For instance, a failure will occur every 50 minutes in a system with one million components,
even if the MTBF (mean time between failures) of a single component is as large as 100 years.
Obviously, failure handling is critical for highly parallel applications that use a large number of components
for a significant amount of time, because such applications are likely to experience at least one failure
during execution. But failure handling cannot be ignored for other applications.
Large-scale distributed systems face a second important challenge: power consumption.
Power management is necessary due to both monetary and environmental constraints.
Presently large computing centers are among the largest consumers of energy.
Energy is needed to provide power to the individual cores and also to provide cooling for the system.
In future distributed systems, it is anticipated that the power dissipated to perform communications
and I/O transfers will make up a much larger share of the overall power consumption.
In fact, the relative cost of communication is expected to increase dramatically,
both in terms of latency/overhead and of consumed energy.
Outline of the class:
Techniques and scheduling algorithms for resilience
Introduction to resilience
Handling failures by adding redundancy (replicating some work)
Checkpointing and recovery techniques
Handling silent errors with verifications
Adding energy into the picture
Why energy consumption is an important topic?
Scheduling algorithms using dynamic voltage and frequency scaling (DVFS)
Static power and switching off to reduce energy consumption
Energy consumption and resilience
Prerequisites: Knowledge on classical algorithmic techniques (dynamic
programming, greedy algorithms) and on complexity analysis
(NP-completeness, approximation algorithms).
Each student will be given two research articles to synthesize,
compare, and criticize. This work will be evaluated through a written
report and an oral presentation. Furthermore, there will be some
homework during the semester.