Master 2 course for 2016-2017: CR02 - Resilient and energy-aware scheduling algorithms

Lecturer: Anne Benoit; Courses Monday 13h15-15h15 room B1

Course notes and slides: Final evaluation: The final evaluation for this course is based on a bibliographical study of one or two research articles to be picked in the list below (two articles for topics partially covered during the course). Each student must choose a topic and the corresponding article(s) and notify the teacher by mail (to Anne.Benoit at ens-lyon.fr, with a first-come first-serve algorithm). As soon as a topic has been picked, it will be indicated on this page. Each student then has to write a report (4 to 8 pages) presenting and commenting the chosen article(s), and present it in class (presentation of 15 minutes followed by 5 minutes of questions). The report must be sent at the latest on Sunday December 11, and presentations will be spread between December 12 and December 13. The final grade will build upon the report, the presentation, and the participation to questions. Also, each student will be asked to write a short review (less than one page) on one of the other presentations, once all presentations are done.

Motivation: Large-scale distributed systems correspond to wide range of platforms, such as high-performance supercomputers, sensor networks, volunteer computing grids, etc. All these systems have a very large scale, and include millions of components. Such a large scale induces two major problems: resilience and energy-consumption.

Resilience is (loosely) defined as surviving to failures. For instance, a failure will occur every 50 minutes in a system with one million components, even if the MTBF (mean time between failures) of a single component is as large as 100 years. Obviously, failure handling is critical for highly parallel applications that use a large number of components for a significant amount of time, because such applications are likely to experience at least one failure during execution. But failure handling cannot be ignored for other applications.

Large-scale distributed systems face a second important challenge: power consumption. Power management is necessary due to both monetary and environmental constraints. Presently large computing centers are among the largest consumers of energy. Energy is needed to provide power to the individual cores and also to provide cooling for the system. In future distributed systems, it is anticipated that the power dissipated to perform communications and I/O transfers will make up a much larger share of the overall power consumption. In fact, the relative cost of communication is expected to increase dramatically, both in terms of latency/overhead and of consumed energy.

Outline of the class: Prerequisites: Knowledge on classical algorithmic techniques (dynamic programming, greedy algorithms) and on complexity analysis (NP-completeness, approximation algorithms).

Evaluation: Each student will be given two research articles to synthesize, compare, and criticize. This work will be evaluated through a written report and an oral presentation. Furthermore, there will be some homework during the semester.

  E-Mail: Anne.Benoit at ens-lyon.fr Last modified: Mon Dec 5 23:59:50 CET 2016