Keystone Associate Team, 2016-2018

Scheduling algorithms for sparse linear algebra at extreme scale

Principal investigator (Inria): Anne Benoit

Principal investigator (Vanderbilt University): Padma Raghavan

Members: Abstract: The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption. The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts. We believe that the complementary expertise of the two teams in the area of scheduling HPC applications at scale (ROMA - models and complexity; and Vanderbilt - architecture and applications) is the key to the success of this associate team.

Joint publications:
Scientific achievements:

The long-term objective of this Keynote Associate Team is to design robust sparse-linear kernels for computing at extreme scale. Achieving this goal is fundamental for HPC applications, since it is estimated that supercomputers spend half their time solving linear systems, a vast majority of them being sparse. However, it is a difficult goal that requires to solve several challenges dealing with application-specific resilience techniques, silent error detection and mitigation, energy consumption, data locality, load-balancing, solver robustness, etc. Most likely, next generation kernels will not resemble current ones, they will have different preconditioners, embedded verification mechanisms, extended work-stealing capabilities, etc. This is a whole world of new paradigms to explore. While we will not solve all problems by ourselves, we are well-qualified to address some of them, due to our complementary expertises that cover most of the algorithmic, software and hardware aspects of this challenging goal.
During the three years of the Associate Team, we have made considerable progress on the comprehension of co-scheduling challenges, resilience, and scheduling. We have organized the scientific achievements of Keynote into five sections: resilience techniques, co-scheduling algorithms, replication to handle errors, list scheduling vs pack scheduling, and high-performance neuroscience applications.

1. Resilience techniques for scheduling algorithms. In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. Finally, we follow the very same approach for the combination of checkpointing with replication. This work, entitled Scheduling for fault-tolerance: an introduction, corresponds to Inria RR-8971 (November 2016) and it has been accepted as a chapter of a forthcoming CDER book (NSF/IEEE-TCPP Curriculum Initiative: Topics in Parallel and Distributed Computing) on teaching material for parallel and distributed computing. The book is expected to be published by Springer in 2018 (see [J-1]).

2. Co-scheduling algorithms for cache-partitioned systems. Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this work, we provide answers to (i) and (ii) for both Amdahl applications and perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. Both a workshop paper (perfectly parallel applications and rational number of processors, [J-2]) and a journal paper (extending to Amdahl applications and integer number of processors, [J-3]) have been published in 2017 with these results.
Following these encouraging theoretical and simulated results, we have gained access to a machine on which it is possible to experiment with cache partitioning. Indeed, the Cache Allocation Technology (CAT) is a recently released Intel technology, part of the Intel Resource Director Technology (RDT). Basically, CAT is an interface for the operating system to group applications into classes of service (COS). Each class of service describes the amount of resources the applications assigned to that class can use. Resources can be amount of cache or memory bandwidth. We are therefore able to partition the cache, building upon this technology.
We run several applications concurrently on an Intel Xeon E5-2650L v4 with 14 processing units and a shared 35 MB LLC. Slots of 5% of the cache can be allocated to applications. We consider six classical HPC applications from the NAS Parallel Benchmark suite (D.H. Bailey et al, The NAS Parallel Benchmarks: Summary and Preliminary Results, Supercomputing'1991): CG (Conjugate Gradiant), BT (Block Tridiagonal), LU (Lower-Upper symmetric Gauss-Seidel), SP (Scalar Pentadiagonal), MG (MultiGrid) and FT (Fast Fourier Transform). In order to conduct real experiments, we have focused on iterative applications running concurrently, and we have answered the following questions: (i) how to precisely model the behavior of these applications on the cache partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC application are co-scheduled onto CMP platforms. These experimental results have been published at Cluster'2018 [J-5], and presented both in Belfast for the conference, and at the CCDSC'2018 workshop.

3. Using replication to handle fail-stop and silent errors on large-scale platforms. With Hongyang Sun joining Keystone in 2017 as a researcher at Vanderbilt University, we have strengthen our collaboration on resilience techniques. We have worked on providing a model and an analytical study of replication as a technique to detect and correct silent errors, as well as to cope with both silent and fail-stop errors on large-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for which a detection mechanism is required. To detect silent errors, many application-specific techniques are available, either based on algorithms (ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject only to silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to correct some silent errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck.
We have provided a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We have also reported a set of extensive simulation results that nicely corroborates the analytical model. Preliminary results dealing only with silent errors have been accepted and presented at the FTXS workshop, in conjunction with HPDC'17, Washington DC, USA, in June 2017 [J-6]. An extended version considering both fail-stop and silent errors has been accepted to JPDC in 2018 [J-7].

4. Scheduling parallel tasks under multiple resources: list scheduling vs. pack scheduling. Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new computing frameworks to support more efficient data processing, transfer and storage for future HPC systems. This trend towards data-driven computing demands the scheduling solutions to also consider other resources (e.g., I/O, memory, cache) that can be shared amongst competing applications.
During Rédouane Elghazi's visit at Vanderbilt in the summer 2017, we have studied the problem of scheduling HPC applications while exploring the availability of multiple types of resources that could impact their performance. The goal is to minimize the overall execution time, or makespan, for a set of moldable tasks under multiple-resource constraints. Two scheduling paradigms, namely, list scheduling and pack scheduling, are compared through both theoretical analyses and experimental evaluations. Theoretically, we prove, for several algorithms falling in the two scheduling paradigms, tight approximation ratios that increase linearly with the number of resource types. As the complexity of direct solutions grows exponentially with the number of resource types, we also design a strategy to indirectly solve the problem via a transformation to a single-resource-type problem, which can significantly reduce the algorithms' running times without compromising their approximation ratios. Experiments conducted on Intel Knights Landing with two resource types (processor cores and high-bandwidth memory) and simulations designed on more resource types confirm the benefit of the transformation strategy and show that pack-based scheduling, despite having a worse theoretical bound, offers a practically promising and easy-to-implement solution, especially when more resource types need to be managed. This work has been published and presented at IPDPS'2018, Vancouver, BC, Canada, in May 2018 [J-8].

5. High-performance neuroscience applications. A recent collaboration has been initiated in 2018, in which several members of Keystone are studying how neuroscience application developers can schedule their applications on high-performance platforms at minimum cost. This is on-going work, and we plan to submit a paper by the end of 2018.

Record of activities:

2016: 2017: 2018:

  E-Mail: Anne.Benoit at Last modified: Fri Oct 5 14:01:00 CEST 2018