Keystone Associate Team, 2016-2018

Scheduling algorithms for sparse linear algebra at extreme scale

Principal investigator (Inria): Anne Benoit

Principal investigator (Vanderbilt University): Padma Raghavan

Members: Abstract: The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption. The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts. We believe that the complementary expertise of the two teams in the area of scheduling HPC applications at scale (ROMA -- models and complexity; and SSCL -- architecture and applications) is the key to the success of this associate team. We have already successfully collaborated in the past and expect the collaboration to reach another level thanks to Keystone.

Scientific progress in 2016: This year, we have worked on two axis to advance towards the goals of \key. On one hand, we have come up with a model of cache partitioning in order to efficiently co-schedule applications, and we have studied this problem both from the theoretical and from the practical side. On the other hand, we have provided an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms.
Co-scheduling algorithms for cache-partitioned systems: Our major contribution during the first year is the study of cache-partitioning techniques for co-scheduling applications. Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider $n$ applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the $n$ applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. We have provided answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. This collaborative work has been submitted to an international conference, and an extended version is in preparation for submission to a journal special-issue. It corresponds to Inria RR-8965 (November 2016).
Resilience techniques for scheduling algorithms: In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. Finally we follow the very same approach for the combination of checkpointing with replication. This work corresponds to Inria RR-8971 (November 2016) and has been submitted as a chapter of a forthcoming CDER book (IEEE TCPP and NSF) on teaching material for parallel and distributed computing.

Scientific progress in 2017: During the second year of the associate team, we have worked on two major axis to advance towards the goals of Keystone. Firstly, we have further refined our model of cache partitioning in order to efficiently co-schedule applications, and we have pursued the study of this problem from the theoretical side. Furthermore, we have gained access to machines on which we can experiment with cache partitioning, and we have initiated a set of real-life experiments on linear algebra kernels. Secondly, we have pursued some collaboration on resilience techniques for scheduling algorithms, aiming at coping with faults on large-scale parallel platforms.
Co-scheduling algorithms for cache-partitioned systems: We have pursued the study of cache-partitioning techniques for co-scheduling applications. Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the n applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? In this work, we provide answers to (i) and (ii) for both Amdahl applications and perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for Amdahl applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. Both a workshop paper (perfectly parallel applications and rational number of processors) and a journal paper (extending to Amdahl applications and integer number of processors) have been published in 2017 with these results. Following these encouraging theoretical and simulated results, we have gained access to a machine on which it is possible to experiment with cache partitioning. Indeed, the Cache Allocation Technology (CAT) is a recently released Intel technology, part of the Intel Resource Director Technology (RDT). Basically, CAT is an interface for the operating system to group applications into classes of service (COS). Each class of service describes the amount of resources the applications assigned to that class can use. Resources can be amount of cache or memory bandwidth. We are therefore able to partition the cache, building upon this technology. We run several applications concurrently on an Intel Xeon E5-2650L v4 with 14 processing units and a shared 35 MB LLC. Slots of 5% of the cache can be allocated to applications. We consider six classical HPC applications from the NAS Parallel Benchmark suite: CG (Conjugate Gradiant), BT (Block Tridiagonal), LU (Lower-Upper symmetric Gauss-Seidel), SP (Scalar Pentadiagonal), MG (MultiGrid) and FT (Fast Fourier Transform). In the first results, we have clearly identified cases where it was very useful to use the cache partitioning feature and to judiciously decide which fraction of cache to assign to each application. We are currently refining the model to better fit the experimental observations. We plan to submit a paper reporting these experimental results early next year.
Using replication to handle fail-stop and silent errors on large-scale platforms: With Hongyang Sun joining Keystone in 2017 as a researcher at Vanderbilt University, we have strengthen our collaboration on resilience techniques. We have worked this year on providing a model and an analytical study of replication as a technique to detect and correct silent errors, as well as to cope with both silent and fail-stop errors on large-scale platforms. Fail- stop errors are immediately detected, unlike silent errors for which a detection mechanism is required. To detect silent errors, many application-specific techniques are available, either based on algorithms (ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject only to silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to correct some silent errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We have provided a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We have also reported a set of extensive simulation results that nicely corroborates the analytical model. Preliminary results dealing only with silent errors have been accepted and presented at the FTXS workshop, in conjunction with HPDC'17, Washington DC, USA, in June 2017. An extended version considering both fail-stop and silent errors has been submitted to JPDC.

Work program for 2018: The long-term objective of this collaboration is to design robust sparse-linear kernels for com- puting at extreme scale. Achieving this goal is fundamental for HPC applications, since it is estimated that supercomputers spend half their time solving linear systems, a vast majority of them being sparse. However, it is a difficult goal that requires to solve several challenges dealing with application-specific resilience techniques, silent error detection and mitigation, energy consumption, data locality, load-balancing, solver robustness, etc. Most likely, next generation kernels will not resemble current ones, they will have different preconditioners, embedded verification mechanisms, extended work-stealing capabilities, etc. This is a whole world of new paradigms to explore. While we will not solve all problems by ourselves, we are well-qualified to address some of them, due to our complementary expertises that cover most of the algorithmic, software and hardware aspects of this challenging goal.
During the two first years of the associate team, we have made considerable progress on the comprehension of co-scheduling challenges, as well as on resilience techniques. We have not yet been able to develop locality-aware algorithms, and this is definitively in our plans for next year. In particular, we plan to focus on developing models for NUMA multicores and task graph models for sparse linear algebra kernels that represent features at the appropriate detail to enable performance improvements through efficient algorithms. Through our experiments, we have seen that applications do not necessarily follow our models and it seems interesting to consider a certain level of non-clairvoyance in the actual applications that will be running. We would therefore need to develop dynamic algorithms such as work-stealing algorithms to provide some robustness for this non-clairvoyance, and this is part of our work plan for next year. We seek theoretical models and complexity results that can improve our understanding of work-stealing to provide scheduling frameworks that could be broadly used for such applications.
Another direction for next year is to study the impact of co-scheduling, cache partitioning and resilience on energy consumption to tackle the last topic targeted by Keystone. Finally, we will aim at validating the results on real computational kernels. The experiments started on cache partitioning are very promising towards this validation.

Record of activities:

2016: 2017:

  E-Mail: Anne.Benoit at ens-lyon.fr Last modified: Wed Nov 15 23:32:44 PST 2017