Motivation:
Large-scale distributed systems correspond to wide range of platforms, such as
high-performance supercomputers, sensor networks, volunteer computing grids, etc.
All these systems have a very large scale, and include millions of components.
Such a large scale induces two major problems: resilience and energy-consumption.
Resilience is (loosely) defined as surviving to failures.
For instance, a failure will occur every 50 minutes in a system with one million components,
even if the MTBF (mean time between failures) of a single component is as large as 100 years.
Obviously, failure handling is critical for highly parallel applications that use a large number of components
for a significant amount of time, because such applications are likely to experience at least one failure
during execution. But failure handling cannot be ignored for other applications.
Large-scale distributed systems face a second important challenge: power consumption.
Power management is necessary due to both monetary and environmental constraints.
Presently large computing centers are among the largest consumers of energy.
Energy is needed to provide power to the individual cores and also to provide cooling for the system.
In future distributed systems, it is anticipated that the power dissipated to perform communications
and I/O transfers will make up a much larger share of the overall power consumption.
In fact, the relative cost of communication is expected to increase dramatically,
both in terms of latency/overhead and of consumed energy.
Outline of the class:
Techniques and scheduling algorithms for resilience
Introduction to resilience
Handling failures by adding redundancy (replicating some work)
Checkpointing and recovery techniques
Predicting failures
Handling silent errors with verifications
Adding energy into the picture
Why energy consumption is an important topic?
Scheduling algorithms using dynamic voltage and frequency scaling (DVFS)
Static power and switching off to reduce energy consumption
Energy consumption and resilience
Prerequisites: Knowledge on classical algorithmic techniques (dynamic
programming, greedy algorithms) and on complexity analysis
(NP-completeness, approximation algorithms).
Evaluation:
Each student will be given two research articles to synthesize, compare, and criticize. This work will be evaluated through a written report and an oral presentation.