Master 2 course for 2015-2016: Resilient and energy-aware scheduling algorithms

Lecturer: Anne Benoit

Course homepage: http://graal.ens-lyon.fr/~abenoit/CR02/

Motivation: Large-scale distributed systems correspond to wide range of platforms, such as high-performance supercomputers, sensor networks, volunteer computing grids, etc. All these systems have a very large scale, and include millions of components. Such a large scale induces two major problems: resilience and energy-consumption.

Resilience is (loosely) defined as surviving to failures. For instance, a failure will occur every 50 minutes in a system with one million components, even if the MTBF (mean time between failures) of a single component is as large as 100 years. Obviously, failure handling is critical for highly parallel applications that use a large number of components for a significant amount of time, because such applications are likely to experience at least one failure during execution. But failure handling cannot be ignored for other applications.

Large-scale distributed systems face a second important challenge: power consumption. Power management is necessary due to both monetary and environmental constraints. Presently large computing centers are among the largest consumers of energy. Energy is needed to provide power to the individual cores and also to provide cooling for the system. In future distributed systems, it is anticipated that the power dissipated to perform communications and I/O transfers will make up a much larger share of the overall power consumption. In fact, the relative cost of communication is expected to increase dramatically, both in terms of latency/overhead and of consumed energy.

Outline of the class:

Techniques and scheduling algorithms for resilience
- Introduction to resilience
- Handling failures by adding redundancy (replicating some work)
- Checkpointing and recovery techniques
- Predicting failures
- Handling silent errors with verifications
Adding energy into the picture
- Why energy consumption is an important topic?
- Scheduling algorithms using dynamic voltage and frequency scaling (DVFS)
- Static power and switching off to reduce energy consumption
- Energy consumption and resilience

Prerequisites: Knowledge on classical algorithmic techniques (dynamic programming, greedy algorithms) and on complexity analysis (NP-completeness, approximation algorithms).

Evaluation: Each student will be given two research articles to synthesize, compare, and criticize. This work will be evaluated through a written report and an oral presentation.

E-Mail: Anne.Benoit at ens-lyon.fr

Last modified: Mon Sep 21 15:48:34 CEST 2015