An overview of fault-tolerant techniques for HPC
Thomas Hérault & Yves Robert
ICS Tutorial -- Eugene, OR -- June 10, 2013
Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on
fault-tolerant techniques for high-performance computing. It is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions
(Exponential, Weibull, Log-Normal);
(ii) Application-specific techniques, such as
ABFT for grid-based algorithms or fixed-point convergence for iterative applications;
techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication; and
(iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young's approximation
to Daly's formulas and recent work).
The half-day tutorial is open to all ICS 2013 attendees who are interested in the current status and expected promise of fault-tolerant approaches
for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models.
Only the last part of the tutorial devoted to assessing the future of the methods will involve more advanced analysis tools.
Preliminary version of tutorial slides
This tutorial has been given at SC'12. See
Slides of SC'12 tutorial.
This tutorial will present a comprehensive survey on the techniques
proposed to deal with failures in high performance systems. The main goal is to provide the attendees
with a clear picture of this important topic: what are the techniques, how do they work, and how can they
be evaluated? The tutorial is organized in four parts:
- Overview of failure types: we will start the tutorial by
discussing the failure types: how failures happen, what kind of
fault must be handled, how they can be characterized
- Application-specific techniques: we will discuss how applications
can be modified to take advantage of internal properties and tolerate
failures. We will introduce techniques and tools to extend the limited
fault tolerance capability of programming middleware, and multiple
application examples will be presented, ranging from simple bag of
tasks to ABFT linear algebra.
- General-purpose techniques: when application-specific techniques
are too complex to introduce, or non efficient, system-level general
techniques, that allow to tolerate failures in any kind of
applications can be considered. These approach however, are diverse
and come with various costs. We will present the major system-level
approaches (replication, and many techniques for rollback/recovery)
- and in the last part, we will present a performance model for
these approaches, and discuss relevant execution scenarios. These
scenarios will be evaluated and compared through quantitative models
(from Young's approximation to Daly's formulas and recent work).
Relevance to ICS 2013 attendees
Reliability is one of the major concerns when envisioning the future
exascale platforms. The IESP (see http://www.exascale.org
projects an increase in node
performance and node concurrency by one or two orders of magnitude,
which translates in a mechanical decrease of the mean time to
interruption of at least one order of magnitude. Because of this
tendency, platform providers, software implementors, and
high-performance application users who target capability runs on such
machines cannot disregard the occurrence of interruption due to a
failure as a rare dramatic event, but must take them into account by
In this tutorial, we will present a
comprehensive survey on the techniques proposed to deal with failures
in high performance systems. At the end of the tutorial, each attendee will know
the available techniques, and will be able to select which is the best suited to his or her application.
Target audience, Content Level & Prerequisites
any person (academic, student, engineer) interested, either because he or she is developing
HPC applications, or because he or she is willing to get acquainted with this important topic.
Beginner 70%, Intermediate 20%, Advanced 10%.
none beyond basic knowledge of parallel applications (that typical ICS attendees are expected to have).
General description of tutorial content
The first part of the tutorial will
present the failures types, and the probability laws that govern them,
as they can be extracted from publicly available histories and error
logs of recent high-performance computers, and extrapolated for future
exascale machines. From a qualitative point of view, failures can be
categorized along two axes: temporality and gravity of impact. A
failure can be permanent, infrequent, or intermittent on the temporal
axis, and it's gravity can be to cease of functioning, or to change
the memory on the gravity axis. A fail-stop failure is a permanent
cease of functioning from one or multiple processes. When a
memory corruption hits the system infrequently, a transient soft error
occurs; if it hits the system intermittently, or if it hits the
program itself, byzantine, or arbitrary behavior can be expected from
the failed processes.
The literature provides a classification of the impact of failures:
many problems cannot be solved when confronted with intermittent memory
corruptions, and some problems remain theoretically impossible to
solve for permanent cease to function behaviors. However, on a
practical side, machines can be expected to be confronted with
some permanent failures and infrequent memory corruptions, whose
probability to appear can be qualified by well known probability
distributions. The rest of the tutorial will focus on the most
frequent failure types, and specifically on fail-stop behaviors.
Given a fault-tolerant middleware, -- i.e. a programming paradigm that
allows correct processes of a parallel application to continue their
execution after some of them are subject to failures -- applications
can often use mathematical properties of the problem they are solving,
or specific techniques embedded in the application, to tolerate
failures. The tutorial first starts by presenting how the simplest
algorithms (Bags of tasks, master worker) can be modified to tolerate
some failures, then how iterative refinement on domain decomposition
can be handled, and finally how the Algorithmic-Based Fault-Tolerant
techniques are used in dense Linear Algebra. The tutorial will
briefly present these techniques, and compare them qualitatively,
before introducing how they are modeled.
More general approaches to fault-tolerance have been proposed, based
on rollback-recovery or replication. In its third part, the tutorial
will present these techniques, that can be categorized in different
groups. Replication can be active or passive, and involve the user code or
not. It can be used to ensure the reliability of the computation
itself, or simply to ensure the storage of redundant
data. Rollback-recovery is a well studied general-purpose approach,
and multiple protocols have been proposed. The tutorial will study
many aspects of these protocols, especially their coordination, their
composition, their behavior at rollback time, and how they can save
the checkpointing information. This part will conclude with models for
the main categories of protocols, to allow a fair comparison between
Comparison of techniques.
This comparison concludes the tutorial. It will
instantiate the proposed models, on different existing or envisioned
architectures, to evaluate their projected relative
performance. During this part, a summary of the qualitative benefits
and drawbacks of the different protocols will also provide an
additional tool for users when it is time to choose what approach they
will use to tolerate failures in their systems.
Both presenters work at the University of Tennessee. They will build on their complementary expertise
to give a well-balanced tutorial.
Thomas Herault is an expert in fault-tolerance protocols, MPI, and
middleware for high-performance computing.
Yves Robert is an expert in high-performance computing, ABFT techniques, scheduling, probability
theory, and stochastic models.
This is the second time this tutorial is being proposed. The first issue was for SC 2012 in Salt Lake City, with the same length
and format. Attendance was 60+ participants.
Both the contents of the tutorial and the presentation slides will be improved for this second edition at ICS 2013.
We are committed to all efforts to make the important topic of fault-tolerance for HPC available for ICS attendees
in the best possible format.
An overview of fault-tolerant techniques for HPC
- Introduction & motivation (20 min)
- Large-scale computing platforms
- Failure everywhere and more frequent
- Failure types
- Failure probability distributions
- Application-specific fault-tolerance techniques (40min)
- Bags of tasks
- ABFT for grid-based applications
- Iterative algorithms and fixed-point convergence
- General-purpose fault-tolerance techniques (1h)
- Fault-tolerant MPI
- Coordinated checkpointing
- Message logging
- Un-coordinated checkpointing
- Probabilistic models and execution scenarios (1h)
- Young approximation
- Daly's formula
- Checkpointing parallel jobs
- Platform yield
- Conclusion (20 min)
- Lessons learned
- Bibliographic pointers for further reference
- Perspectives: resilience at exascale
Durations are indicative. They are based on a total of 3h20, plus a small 10mn break, to match
the tutorial length of 3h30. Questions from participants will be taken on the fly.