An overview of fault-tolerant techniques for HPC

Thomas Hérault & Yves Robert

ICS Tutorial -- Eugene, OR -- June 10, 2013

Abstract

Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey on fault-tolerant techniques for high-performance computing. It is organized along four main topics: (i) An overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal); (ii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications; (iii) General-purpose techniques, which include several checkpoint and rollback recovery protocols, possibly combined with replication; and (iv) Relevant execution scenarios will be evaluated and compared through quantitative models (from Young's approximation to Daly's formulas and recent work). The half-day tutorial is open to all ICS 2013 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. Only the last part of the tutorial devoted to assessing the future of the methods will involve more advanced analysis tools.

Preliminary version of tutorial slides

This tutorial has been given at SC'12. See Slides of SC'12 tutorial.

Detailed description

Tutorial goals

This tutorial will present a comprehensive survey on the techniques proposed to deal with failures in high performance systems. The main goal is to provide the attendees with a clear picture of this important topic: what are the techniques, how do they work, and how can they be evaluated? The tutorial is organized in four parts:

Overview of failure types: we will start the tutorial by discussing the failure types: how failures happen, what kind of fault must be handled, how they can be characterized
Application-specific techniques: we will discuss how applications can be modified to take advantage of internal properties and tolerate failures. We will introduce techniques and tools to extend the limited fault tolerance capability of programming middleware, and multiple application examples will be presented, ranging from simple bag of tasks to ABFT linear algebra.
General-purpose techniques: when application-specific techniques are too complex to introduce, or non efficient, system-level general techniques, that allow to tolerate failures in any kind of applications can be considered. These approach however, are diverse and come with various costs. We will present the major system-level approaches (replication, and many techniques for rollback/recovery)
and in the last part, we will present a performance model for these approaches, and discuss relevant execution scenarios. These scenarios will be evaluated and compared through quantitative models (from Young's approximation to Daly's formulas and recent work).

Relevance to ICS 2013 attendees

Reliability is one of the major concerns when envisioning the future exascale platforms. The IESP (see http://www.exascale.org) projects an increase in node performance and node concurrency by one or two orders of magnitude, which translates in a mechanical decrease of the mean time to interruption of at least one order of magnitude. Because of this tendency, platform providers, software implementors, and high-performance application users who target capability runs on such machines cannot disregard the occurrence of interruption due to a failure as a rare dramatic event, but must take them into account by allowing fault-tolerance. In this tutorial, we will present a comprehensive survey on the techniques proposed to deal with failures in high performance systems. At the end of the tutorial, each attendee will know the available techniques, and will be able to select which is the best suited to his or her application.

Target audience, Content Level & Prerequisites

Target audience: any person (academic, student, engineer) interested, either because he or she is developing HPC applications, or because he or she is willing to get acquainted with this important topic.

Content Level: Beginner 70%, Intermediate 20%, Advanced 10%.

Prerequisites: none beyond basic knowledge of parallel applications (that typical ICS attendees are expected to have).

General description of tutorial content

Failure types. The first part of the tutorial will present the failures types, and the probability laws that govern them, as they can be extracted from publicly available histories and error logs of recent high-performance computers, and extrapolated for future exascale machines. From a qualitative point of view, failures can be categorized along two axes: temporality and gravity of impact. A failure can be permanent, infrequent, or intermittent on the temporal axis, and it's gravity can be to cease of functioning, or to change the memory on the gravity axis. A fail-stop failure is a permanent cease of functioning from one or multiple processes. When a memory corruption hits the system infrequently, a transient soft error occurs; if it hits the system intermittently, or if it hits the program itself, byzantine, or arbitrary behavior can be expected from the failed processes. The literature provides a classification of the impact of failures: many problems cannot be solved when confronted with intermittent memory corruptions, and some problems remain theoretically impossible to solve for permanent cease to function behaviors. However, on a practical side, machines can be expected to be confronted with some permanent failures and infrequent memory corruptions, whose probability to appear can be qualified by well known probability distributions. The rest of the tutorial will focus on the most frequent failure types, and specifically on fail-stop behaviors.

Application-specific techniques. Given a fault-tolerant middleware, -- i.e. a programming paradigm that allows correct processes of a parallel application to continue their execution after some of them are subject to failures -- applications can often use mathematical properties of the problem they are solving, or specific techniques embedded in the application, to tolerate failures. The tutorial first starts by presenting how the simplest algorithms (Bags of tasks, master worker) can be modified to tolerate some failures, then how iterative refinement on domain decomposition can be handled, and finally how the Algorithmic-Based Fault-Tolerant techniques are used in dense Linear Algebra. The tutorial will briefly present these techniques, and compare them qualitatively, before introducing how they are modeled.

General-purpose techniques. More general approaches to fault-tolerance have been proposed, based on rollback-recovery or replication. In its third part, the tutorial will present these techniques, that can be categorized in different groups. Replication can be active or passive, and involve the user code or not. It can be used to ensure the reliability of the computation itself, or simply to ensure the storage of redundant data. Rollback-recovery is a well studied general-purpose approach, and multiple protocols have been proposed. The tutorial will study many aspects of these protocols, especially their coordination, their composition, their behavior at rollback time, and how they can save the checkpointing information. This part will conclude with models for the main categories of protocols, to allow a fair comparison between the techniques.

Comparison of techniques. This comparison concludes the tutorial. It will instantiate the proposed models, on different existing or envisioned architectures, to evaluate their projected relative performance. During this part, a summary of the qualitative benefits and drawbacks of the different protocols will also provide an additional tool for users when it is time to choose what approach they will use to tolerate failures in their systems.

Presenters

Both presenters work at the University of Tennessee. They will build on their complementary expertise to give a well-balanced tutorial. Thomas Herault is an expert in fault-tolerance protocols, MPI, and middleware for high-performance computing. Yves Robert is an expert in high-performance computing, ABFT techniques, scheduling, probability theory, and stochastic models.

This is the second time this tutorial is being proposed. The first issue was for SC 2012 in Salt Lake City, with the same length and format. Attendance was 60+ participants. Both the contents of the tutorial and the presentation slides will be improved for this second edition at ICS 2013. We are committed to all efforts to make the important topic of fault-tolerance for HPC available for ICS attendees in the best possible format.

Detailed outline

An overview of fault-tolerant techniques for HPC

Introduction & motivation (20 min)
- Large-scale computing platforms
- Failure everywhere and more frequent
- Failure types
- Failure probability distributions
Application-specific fault-tolerance techniques (40min)
- Bags of tasks
- ABFT for grid-based applications
- Iterative algorithms and fixed-point convergence
General-purpose fault-tolerance techniques (1h)
- Fault-tolerant MPI
- Coordinated checkpointing
- Message logging
- Un-coordinated checkpointing
Probabilistic models and execution scenarios (1h)
- Young approximation
- Daly's formula
- Checkpointing parallel jobs
- Platform yield
- Replication
Conclusion (20 min)
- Lessons learned
- Bibliographic pointers for further reference
- Perspectives: resilience at exascale

Durations are indicative. They are based on a total of 3h20, plus a small 10mn break, to match the tutorial length of 3h30. Questions from participants will be taken on the fly.