Master 2 course for 2016-2017: CR02 - Resilient and energy-aware scheduling algorithms

Lecturer: Anne Benoit; Courses Monday 13h15-15h15 room B1

Course notes and slides:

Sept. 15 at 10h15: The great scheduling zoo (see the slides). If you were intrigued by the problem of scheduling unrelated parallel machines, you can have a look at this paper (but beware, it is difficult).
Sept. 19: Divisible load scheduling (see the slides).
Todo: Prove the theorem on Slide 61, showing that a multi-round schedule cannot improve an optimal single-round schedule by more than a factor 2, before looking at the solution (Slide 62).
If you want to see another beautiful technique of relaxation so that scheduling problems become tractable, you can have a look at steady-state scheduling, for instance paper1 and paper2.
Sept. 26: Introduction to resilience: Faults and MTBF, and checkpointing protocols (see the slides).
Todo: Compute the value of T to minimize the waste.
*** Homework for October 17 ***
Oct. 3: Probabilistic models to determine the optimal checkpointing period, both in the coordinated protocol and in the hierarchical one, and brief presentation of the double checkpointing algorithm (see the slides). More details on the failures, protocols and probabilistic models can be found in the first chapter of the monograph "Fault tolerance techniques for high-performance computing" (see here), and a full study of hierachical protocols is available here.
Oct. 10: Advanced probabilistic models: analysis of the buddy algorithm, and of the impact of prediction and replication (see the slides). More details in the monograph, and details about the buddy algorithm here, analysis of prediction here, and prediction with intervals here.
Todo: Figure out how to obtain the formula in the last theorem, and don't forget to finish your homework for next week!
~~Oct. 11 at 13h15~~
Oct. 17: Wrapping up replication, and presentation of algorithm-based fault tolerance (ABFT) and a composite approach. (see the slides).
Checkpointing a DAG: computing the expected execution time and complexity results (see the slides).
Todo: Think of the "join" case for DAGs: where does the complexity comes from? Does the ordering of tasks matter? In which order should we proceed if we have both checkpointed and non-checkpointed tasks?
Oct. 24: Back on DAGs (NP-completeness for joins), and dealing with silent errors (see the slides).
*** Start thinking about your choice of paper(s) for the final evaluation (see below) ***
Autumn break
Nov. 7: Energy-aware scheduling algorithms (see the slides).
Todo: Prove the lemma for slack reclaiming with continuous speeds for fork and join graphs (see last slide).
Nov. 21: Wrapping up last course (for slack reclaiming with discrete speeds) and pipelined applications: tri-criteria optimization problems involving power minimization (see the slides).
Reference papers for the last two courses: greedy algorithms, energy models, streaming applications.
Dec. 5: Combining resilience and energy (see the slides).
*** Recall that reports for final evaluation are due Dec. 11 (to be sent by email in pdf format) ***
Dec. 12: Student presentations: Edin, Xuan, Johanna, Octave, Etienne, Valentin.
Dec. 13 at 13h30, room B1: Student presentations: Victor, Maverick, Laureline, Jean-Yves, Rémi.

Final evaluation: The final evaluation for this course is based on a bibliographical study of one or two research articles to be picked in the list below (two articles for topics partially covered during the course). Each student must choose a topic and the corresponding article(s) and notify the teacher by mail (to Anne.Benoit at ens-lyon.fr, with a first-come first-serve algorithm). As soon as a topic has been picked, it will be indicated on this page. Each student then has to write a report (4 to 8 pages) presenting and commenting the chosen article(s), and present it in class (presentation of 15 minutes followed by 5 minutes of questions). The report must be sent at the latest on Sunday December 11, and presentations will be spread between December 12 and December 13. The final grade will build upon the report, the presentation, and the participation to questions. Also, each student will be asked to write a short review (less than one page) on one of the other presentations, once all presentations are done.

Steady-state scheduling: paper1 and paper2.
** Edin Husic ** Scheduling on unrelated machines: paper1 and paper2.
** Thi Xuan Vu ** Multi-organization scheduling: paper.
Communication-aware processor allocation: paper.
Original Young and Daly papers: paper1 and paper2.
Prediction: paper1 and paper2.
ABFT: paper1 and paper2.
Scheduling a chain of tasks: paper1, and one between paper2 or paper3.
** Victor Mollimard ** Multi-level checkpointing: paper.
** Maverick Chardet ** Checkpointing multi-threaded applications: paper.
Soft error resiliency: paper.
Proactive memory error avoidance: paper.
** Laureline Pinault ** Non-memoryless failure distributions: paper.
Fault-tolerant MapReduce-MPI: paper.
** Jean-Yves Franceschi ** Fault modeling with machine learning: paper.
** Rémi Grunblatt ** Failure detection and propagation: paper.
** Johanna Seif ** Fondamental paper on energy: paper.
Energy-efficient kernels: paper.
Maximizing throughput under power budget: paper.
** Octave Mariotti ** Dynamic power sharing: paper.
Data-driven scheduling approach for power management: paper.
** Etienne Moutot ** Cooling and energy: paper.
** Valentin Honoré ** Energy and resilience: paper.
Adaptive and power-aware resilience: paper.

Motivation: Large-scale distributed systems correspond to wide range of platforms, such as high-performance supercomputers, sensor networks, volunteer computing grids, etc. All these systems have a very large scale, and include millions of components. Such a large scale induces two major problems: resilience and energy-consumption.

Resilience is (loosely) defined as surviving to failures. For instance, a failure will occur every 50 minutes in a system with one million components, even if the MTBF (mean time between failures) of a single component is as large as 100 years. Obviously, failure handling is critical for highly parallel applications that use a large number of components for a significant amount of time, because such applications are likely to experience at least one failure during execution. But failure handling cannot be ignored for other applications.

Large-scale distributed systems face a second important challenge: power consumption. Power management is necessary due to both monetary and environmental constraints. Presently large computing centers are among the largest consumers of energy. Energy is needed to provide power to the individual cores and also to provide cooling for the system. In future distributed systems, it is anticipated that the power dissipated to perform communications and I/O transfers will make up a much larger share of the overall power consumption. In fact, the relative cost of communication is expected to increase dramatically, both in terms of latency/overhead and of consumed energy.

Outline of the class:

Techniques and scheduling algorithms for resilience
- Introduction to resilience
- Handling failures by adding redundancy (replicating some work)
- Checkpointing and recovery techniques
- Predicting failures
- Handling silent errors with verifications
Adding energy into the picture
- Why energy consumption is an important topic?
- Scheduling algorithms using dynamic voltage and frequency scaling (DVFS)
- Static power and switching off to reduce energy consumption
- Energy consumption and resilience

Prerequisites: Knowledge on classical algorithmic techniques (dynamic programming, greedy algorithms) and on complexity analysis (NP-completeness, approximation algorithms).

Evaluation: Each student will be given two research articles to synthesize, compare, and criticize. This work will be evaluated through a written report and an oral presentation. Furthermore, there will be some homework during the semester.

E-Mail: Anne.Benoit at ens-lyon.fr

Last modified: Mon Dec 5 23:59:50 CET 2016