Keystone Associate Team, 2016-2018

Scheduling algorithms for sparse linear algebra at extreme scale

Principal investigator (Inria): Anne Benoit

Principal investigator (Vanderbilt University): Padma Raghavan

Members:

Inria ROMA project-team: Anne Benoit, Yves Robert, Aurélien Cavelan, Loic Pottier
Vanderbilt University (previously Computer Science and Engineering, Pennsylvania State University): Padma Raghavan, Guillaume Aupy, Hongyang Sun (and previously Ben Heidorm, Humayun Kabir)

Abstract: The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption. The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts. We believe that the complementary expertise of the two teams in the area of scheduling HPC applications at scale (ROMA - models and complexity; and Vanderbilt - architecture and applications) is the key to the success of this associate team.

Joint publications:

[J-1] G. Aupy and Y. Robert. Scheduling for fault-tolerance: an introduction. In NSF/IEEE- TCPP Curriculum Initiative: Topics in Parallel and Distributed Computing. Springer, 2018.
[J-2] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert, and M. Shantharam. Co-scheduling algorithms for cache-partitioned systems. In 19th Workshop on Advances in Parallel and Distributed Computational Models APDCM 2017. IEEE Computer Society Press, 2017.
[J-3] G. Aupy, A. Benoit, S. Dai, L. Pottier, P. Raghavan, Y. Robert, and M. Shantharam. Co-scheduling amdhal applications on cache-partitioned systems. Int. Journal of High Performance Computing Applications, 32(1):123-138, 2018.
[J-4] G. Aupy, A. Benoit, L. Pottier, P. Raghavan, Y. Robert, and M. Shantharam. Co-scheduling high-performance computing applications. In K.-C. Li, H. Jiang, and A. Zomaya, editors, Big Data Management and Processing. Chapman and Hall/CRC Press, 2017.
[J-5] G. Aupy, A. Benoit, B. Goglin, L. Pottier, and Y. Robert. Co-scheduling HPC workloads on cache-partitioned CMP platforms. In Cluster'2018. IEEE Computer Society Press, 2018.
[J-6] A. Benoit, F. Cappello, A. Cavelan, P. Raghavan, Y. Robert, and H. Sun. Identifying the right replication level to detect and correct silent errors at scale. In FTXS'2017, the Workshop on Fault-Tolerance for HPC at Extreme Scale, in conjunction with HPDC'2017. IEEE Computer Society Press, 2017.
[J-7] A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun. Coping with silent and fail-stop errors at scale by combining replication and checkpointing. J. Parallel and Distributed Computing, 2018.
[J-8] H. Sun, R. Elghazi, A. Gainaru, G. Aupy, and P. Raghavan. Scheduling parallel tasks under multiple resources: List scheduling vs. pack scheduling. In Proceedings of IPDPS'18. IEEE Computer Society Press, 2018.

Scientific achievements:

The long-term objective of this Keynote Associate Team is to design robust sparse-linear kernels for computing at extreme scale. Achieving this goal is fundamental for HPC applications, since it is estimated that supercomputers spend half their time solving linear systems, a vast majority of them being sparse. However, it is a difficult goal that requires to solve several challenges dealing with application-specific resilience techniques, silent error detection and mitigation, energy consumption, data locality, load-balancing, solver robustness, etc. Most likely, next generation kernels will not resemble current ones, they will have different preconditioners, embedded verification mechanisms, extended work-stealing capabilities, etc. This is a whole world of new paradigms to explore. While we will not solve all problems by ourselves, we are well-qualified to address some of them, due to our complementary expertises that cover most of the algorithmic, software and hardware aspects of this challenging goal.
During the three years of the Associate Team, we have made considerable progress on the comprehension of co-scheduling challenges, resilience, and scheduling. We have organized the scientific achievements of Keynote into five sections: resilience techniques, co-scheduling algorithms, replication to handle errors, list scheduling vs pack scheduling, and high-performance neuroscience applications.

1. Resilience techniques for scheduling algorithms. In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. Finally, we follow the very same approach for the combination of checkpointing with replication. This work, entitled Scheduling for fault-tolerance: an introduction, corresponds to Inria RR-8971 (November 2016) and it has been accepted as a chapter of a forthcoming CDER book (NSF/IEEE-TCPP Curriculum Initiative: Topics in Parallel and Distributed Computing) on teaching material for parallel and distributed computing. The book is expected to be published by Springer in 2018 (see [J-1]).

2. Co-scheduling algorithms for cache-partitioned systems. Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this work, we provide answers to (i) and (ii) for both Amdahl applications and perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. Both a workshop paper (perfectly parallel applications and rational number of processors, [J-2]) and a journal paper (extending to Amdahl applications and integer number of processors, [J-3]) have been published in 2017 with these results.
Following these encouraging theoretical and simulated results, we have gained access to a machine on which it is possible to experiment with cache partitioning. Indeed, the Cache Allocation Technology (CAT) is a recently released Intel technology, part of the Intel Resource Director Technology (RDT). Basically, CAT is an interface for the operating system to group applications into classes of service (COS). Each class of service describes the amount of resources the applications assigned to that class can use. Resources can be amount of cache or memory bandwidth. We are therefore able to partition the cache, building upon this technology.
We run several applications concurrently on an Intel Xeon E5-2650L v4 with 14 processing units and a shared 35 MB LLC. Slots of 5% of the cache can be allocated to applications. We consider six classical HPC applications from the NAS Parallel Benchmark suite (D.H. Bailey et al, The NAS Parallel Benchmarks: Summary and Preliminary Results, Supercomputing'1991): CG (Conjugate Gradiant), BT (Block Tridiagonal), LU (Lower-Upper symmetric Gauss-Seidel), SP (Scalar Pentadiagonal), MG (MultiGrid) and FT (Fast Fourier Transform). In order to conduct real experiments, we have focused on iterative applications running concurrently, and we have answered the following questions: (i) how to precisely model the behavior of these applications on the cache partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC application are co-scheduled onto CMP platforms. These experimental results have been published at Cluster'2018 [J-5], and presented both in Belfast for the conference, and at the CCDSC'2018 workshop.

3. Using replication to handle fail-stop and silent errors on large-scale platforms. With Hongyang Sun joining Keystone in 2017 as a researcher at Vanderbilt University, we have strengthen our collaboration on resilience techniques. We have worked on providing a model and an analytical study of replication as a technique to detect and correct silent errors, as well as to cope with both silent and fail-stop errors on large-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for which a detection mechanism is required. To detect silent errors, many application-specific techniques are available, either based on algorithms (ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject only to silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to correct some silent errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck.
We have provided a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We have also reported a set of extensive simulation results that nicely corroborates the analytical model. Preliminary results dealing only with silent errors have been accepted and presented at the FTXS workshop, in conjunction with HPDC'17, Washington DC, USA, in June 2017 [J-6]. An extended version considering both fail-stop and silent errors has been accepted to JPDC in 2018 [J-7].

4. Scheduling parallel tasks under multiple resources: list scheduling vs. pack scheduling. Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new computing frameworks to support more efficient data processing, transfer and storage for future HPC systems. This trend towards data-driven computing demands the scheduling solutions to also consider other resources (e.g., I/O, memory, cache) that can be shared amongst competing applications.
During Rédouane Elghazi's visit at Vanderbilt in the summer 2017, we have studied the problem of scheduling HPC applications while exploring the availability of multiple types of resources that could impact their performance. The goal is to minimize the overall execution time, or makespan, for a set of moldable tasks under multiple-resource constraints. Two scheduling paradigms, namely, list scheduling and pack scheduling, are compared through both theoretical analyses and experimental evaluations. Theoretically, we prove, for several algorithms falling in the two scheduling paradigms, tight approximation ratios that increase linearly with the number of resource types. As the complexity of direct solutions grows exponentially with the number of resource types, we also design a strategy to indirectly solve the problem via a transformation to a single-resource-type problem, which can significantly reduce the algorithms' running times without compromising their approximation ratios. Experiments conducted on Intel Knights Landing with two resource types (processor cores and high-bandwidth memory) and simulations designed on more resource types confirm the benefit of the transformation strategy and show that pack-based scheduling, despite having a worse theoretical bound, offers a practically promising and easy-to-implement solution, especially when more resource types need to be managed. This work has been published and presented at IPDPS'2018, Vancouver, BC, Canada, in May 2018 [J-8].

5. High-performance neuroscience applications. A recent collaboration has been initiated in 2018, in which several members of Keystone are studying how neuroscience application developers can schedule their applications on high-performance platforms at minimum cost. This is on-going work, and we plan to submit a paper by the end of 2018.

Record of activities:

2016:

April 5-7: Visit of Guillaume Aupy in Lyon to discuss with Anne Benoit, Yves Robert and Loic Pottier and initiate the work on co-scheduling algorithms for cache-partitioned systems. This visit was financed by Vanderbilt University.
May 16-17: Visit of Yves Robert in Vanderbilt University, just before the workshop of May 18-20.
May 18-20: Guillaume Aupy, Padma Raghavan and Yves Robert jointly organized a workshop, the 11th Scheduling for Large Scale Systems Workshop, held at Vanderbilt University in Nashville Tennessee. This workshop, by invitation only, was structured as a set of thematic half-day sessions, mainly focused on scheduling and algorithms for large-scale systems. This workshop was organized in the context of the Keystone Associate Team, and was a great opportunity for members of the team to discuss current work within the Associate Team.
June 12 - July 1: Visit of Guillaume Aupy in Lyon (financed by Keystone), during which we were able to better formalize the co-scheduling problem and make great progress on it.
August 15-20: ICPP conference in Philadelphia, USA, where Guillaume Aupy, Anne Benoit, Loic Pottier and Yves Robert all went and spent some time working on Keystone together. The travels were financed with other fundings.
September 1-2: Visite of Yves Robert in Vanderbilt University to work with Guillaume Aupy and Padma Raghavan.
October 3-6: Anne Benoit, Padma Raghavan and Yves Robert attended the Workshop on Clusters, Clouds, and Data for Scientific Computing in La Maison des Contes, Dareizé, France. The travels were financed with other fundings.
November 14-20: Guillaume Aupy, Anne Benoit, Padma Raghavan and Yves Robert all attended SC'16 in Salt Lake City and made advances to the Keystone project. The travels were financed with other fundings.

2017:

January 16-18: Guillaume Aupy, Anne Benoit, Padma Raghavan and Yves Robert attended a meeting for the Supercomputing conference SC'17 in Charleston, USA, and they took this opportunity to discuss current and future plans for Keystone for 2017.
May 25-27: Most of the Keystone participants gathered at the Scheduling at scale workshop organized in Knoxville, TN (Y. Robert was co-organizing the workshop). Three presentations were given by Keystone participants.
May 29 - June 2: Most of the Keystone participants gathered at the IPDPS'17 conference, Orlando, USA, where the first cache partitioning paper was presented at the APDCM workshop.
June 5-7: Guillaume Aupy, Anne Benoit and Padma Raghavan attended the SC'17 Program Committee meeting in Denver, USA, where they pursued discussions on cache partitioning and resilience.
June 26-30: Aurélien Cavelan and Hongyang Sun both attended the FTXS'17 workshop in conjunction with HPDC'17 in Washington DC, USA, and presented the Keystone paper on resilience. They gathered a few days after the workshop to work on the extension, that was subsequently submitted to JPDC.
July 3-7: Guillaume Aupy visited the Inria team in Lyon to actively discuss the cache-partitioning experimental work, refine the models, and establish a detailed roadmap, after we obtained the first preliminary results.
Summer 2017: two and a half month summer internship by Rédouane Elghazi (master's student at ENS Lyon) at Vanderbilt University, working on the scheduling of parallel tasks under multiple resources, and they compared list scheduling to pack scheduling [J-8].
November 14-20: Guillaume Aupy, Anne Benoit, Padma Raghavan and Yves Robert all attended SC'17 in Denver, CO, and discussed plans for Keystone for 2018.

2018:

May 21-25: Most of the Keystone participants gathered at the IPDPS'18 conference, Vancouver, BC, Canada, and the paper on scheduling parallel tasks under multiple resources was presented.
June 18-20: Guillaume Aupy, Yves Robert and Hongyang Sun attended the Scheduling for Large Scale Systems Workshop in Berkeley, CA, USA, where discussions were started on extending the collaborations on neuroscience applications.
Summer 2018: two month internship by Valentin Honoré (PhD student of Guillaume Aupy), working on high-performance neurosciences.
September 3-4: Guillaume Aupy visited the Inria team in Lyon.
September 4-7: Anne Benoit, Yves Robert, Padma Raghavan and Guillaume Aupy attended the Workshop on Clusters, Clouds, and Data for Scientific Computing in La Maison des Contes, Dareizé, France. The travels were financed with other fundings. Anne Benoit gave a talk about the cache partitioning experiments, and several discussions were initiated following her talk.
September 25-28: Hongyang Sun will be visiting the Inria team in Lyon while attending the SBAC-PAD conference. We look forward discussing future collaborations in the context of the Keystone project.
October 8, 2018: Yves Robert will be visiting Guillaume Aupy to discuss I/O management in HPC systems.
November 11-16, 2018: Guillaume Aupy, Padma Raghavan and Yves Robert will all attend SC'18 in Dallas, TX, and look forward to other opportunities to work on Keystone.

E-Mail: Anne.Benoit at ens-lyon.fr

Last modified: Fri Oct 5 14:01:00 CEST 2018