Keystone Associate Team, 2016-2018

Scheduling algorithms for sparse linear algebra at extreme scale

Principal investigator (Inria): Anne Benoit

Principal investigator (Vanderbilt University): Padma Raghavan

Members: Abstract: The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption. The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts. We believe that the complementary expertise of the two teams in the area of scheduling HPC applications at scale (ROMA -- models and complexity; and SSCL -- architecture and applications) is the key to the success of this associate team. We have already successfully collaborated in the past and expect the collaboration to reach another level thanks to Keystone.

Scientific progress in 2016: This year, we have worked on two axis to advance towards the goals of \key. On one hand, we have come up with a model of cache partitioning in order to efficiently co-schedule applications, and we have studied this problem both from the theoretical and from the practical side. On the other hand, we have provided an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms.
Co-scheduling algorithms for cache-partitioned systems: Our major contribution during the first year is the study of cache-partitioning techniques for co-scheduling applications. Cache-partitioned architectures allow subsections of the shared last-level cache (LLC) to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multi-core machine. Consider $n$ applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion time of the $n$ applications. Key scheduling questions are: (i) which proportion of cache and (ii) how many processors should be given to each application? Here, we assign rational numbers of processors to each application, since they can be shared across applications through multi-threading. We have provided answers to (i) and (ii) for perfectly parallel applications. Even though the problem is shown to be NP-complete, we give key elements to determine the subset of applications that should share the LLC (while remaining ones only use their smaller private cache). Building upon these results, we design efficient heuristics for general applications. Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed. This collaborative work has been submitted to an international conference, and an extended version is in preparation for submission to a journal special-issue. It corresponds to Inria RR-8965 (November 2016).
Resilience techniques for scheduling algorithms: In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. Finally we follow the very same approach for the combination of checkpointing with replication. This work corresponds to Inria RR-8971 (November 2016) and has been submitted as a chapter of a forthcoming CDER book (IEEE TCPP and NSF) on teaching material for parallel and distributed computing.

Work program for 2017: The long-term objective of this collaboration is to design robust sparse-linear kernels for computing at extreme scale. Achieving this goal is fundamental for HPC applications, since it is estimated that supercomputers spend half their time solving linear systems, a vast majority of them being sparse. However, it is a difficult goal that %Achieving this goal is difficult and requires to solve several challenges dealing with application-specific resilience techniques, silent error detection and mitigation, energy consumption, data locality, load-balancing, solver robustness, etc. Most likely, next generation kernels will not resemble current ones, they will have different preconditioners, embedded verification mechanisms, extended work-stealing capabilities, etc. This is a whole world of new paradigms to explore. While we will not solve all problems by ourselves, we are well-qualified to address some of them, due to our complementary expertises that cover most of the algorithmic, software and hardware aspects of this challenging goal.
For the second year of the associate team, we would like to develop locality-aware algorithms. In particular, we plan to focus on developing models for NUMA multicores and task graph models for sparse linear algebra kernels that represent features at the appropriate detail to enable performance improvements through efficient algorithms. Because applications might perform differently when actually running (because of congestion, wear of the processors, etc), we plan to have to cope with a certain level of non-clairvoyance in the actual applications that will be running. We will need to develop dynamic algorithms such as work-stealing algorithms to provide some robustness for this non-clairvoyance. We plan to extend the work already done on co-scheduling and to develop dynamic algorithms.
Our approach is based on the following rationale. First, many of the basic kernels required by such sparse applications, such as matrix vector multiplication or breadth-first search, can be formulated as large ``packs'' of fine-grained independent tasks whose performance depends on reducing the cost of data movement between tasks. Now on NUMA architectures, we seek models of ``data affinity aware'' scheduling so that access latencies are reduced by scheduling to increase reuse in shared caches through increased temporal locality. Second, we have found that dynamic schedules are often needed to overcome issues of core variability where NUMA-aware work-stealing leads to significant performance improvements. We seek theoretical models and complexity results that can improve our understanding of this observation to provide scheduling frameworks that could be broadly used for such applications. Third, we have found that when workloads of such scientific applications are mapped to multicores, there is differential impact on applications because of interference in the memory system or potential to increase throughput by co-scheduling specific applications together. Consequently, we seek better models of workload scheduling to decrease workload makespan for corresponding savings in energy efficiencies. We detail below how we plan to initiate the work.

Record of activities:

  E-Mail: Anne.Benoit at Last modified: Thu Dec 22 11:42:39 IST 2016