Inria ROMA project-team: Anne Benoit, Yves Robert, Aurélien Cavelan, Loic Pottier
Vanderbilt University (previously Computer Science and
Engineering, Pennsylvania State University): Padma Raghavan,
Guillaume Aupy (and previously Ben Heidorm, Humayun Kabir)
The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption.
The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts.
We believe that the complementary expertise of the two teams in the
area of scheduling HPC applications at scale (ROMA -- models and
complexity; and SSCL -- architecture and applications) is the key to the success of this associate team. We have already successfully collaborated in the past and expect the collaboration to reach another level thanks to Keystone.
Scientific progress in 2016:
This year, we have worked on two axis to advance towards the goals of \key. On one hand, we have
come up with a model of cache partitioning in order to efficiently co-schedule applications, and we have
studied this problem both from the theoretical and from the practical side. On the other hand, we have
provided an introduction to the design of scheduling algorithms to cope with faults on
large-scale parallel platforms. Co-scheduling algorithms for cache-partitioned systems:
Our major contribution during the first year is the study of cache-partitioning techniques for co-scheduling applications.
Cache-partitioned architectures allow subsections of the
shared last-level cache (LLC) to be exclusively reserved for some
applications. This technique dramatically limits interactions between applications
that are concurrently executing on a multi-core machine.
Consider $n$ applications that execute concurrently, with the objective to minimize the makespan,
defined as the maximum completion time of the $n$ applications.
Key scheduling questions are: (i)
of cache and (ii) how many processors should be given to each application?
Here, we assign rational numbers of processors to each application,
since they can be shared across applications through multi-threading.
We have provided answers to (i) and (ii) for perfectly parallel applications.
the problem is shown to be NP-complete, we give key elements to determine
the subset of applications that should share the LLC
(while remaining ones only use their smaller private cache). Building upon these results,
we design efficient heuristics for general applications.
Extensive simulations demonstrate the usefulness of co-scheduling
when our efficient cache partitioning strategies are deployed.
This collaborative work has been submitted to an international conference, and an extended version is in
preparation for submission to a journal special-issue. It corresponds to Inria RR-8965 (November 2016).
Resilience techniques for scheduling algorithms:
In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms.
We study checkpointing and show how to derive the optimal checkpointing period.
Then we explain how to combine checkpointing with fault prediction,
and discuss how the optimal period is modified when this combination is used.
Finally we follow the very same approach for the combination of checkpointing with
replication. This work corresponds to Inria RR-8971 (November 2016) and has been submitted as a chapter of a forthcoming CDER book (IEEE TCPP and NSF)
on teaching material for parallel and distributed computing.
Work program for 2017:
The long-term objective of this collaboration is to design robust sparse-linear kernels
for computing at extreme scale. Achieving this goal is fundamental for HPC applications,
since it is estimated that supercomputers spend half their time solving linear systems, a
vast majority of them being sparse. However, it is a difficult goal that
%Achieving this goal is difficult and
requires to solve several challenges dealing with application-specific
resilience techniques, silent error detection and mitigation, energy
consumption, data locality, load-balancing, solver robustness, etc. Most likely,
next generation kernels will not resemble current ones, they will have different
preconditioners, embedded verification mechanisms, extended work-stealing
capabilities, etc. This is a whole world of new paradigms to explore.
While we will not solve all problems by ourselves, we are well-qualified to address some of them,
due to our complementary expertises that cover most of the algorithmic, software and hardware
aspects of this challenging goal.
For the second year of the associate team, we would like to develop locality-aware
algorithms. In particular, we plan to focus on developing models for NUMA
multicores and task graph models for sparse linear algebra kernels that represent
features at the appropriate detail to enable performance improvements through
Because applications might perform differently when actually running
(because of congestion, wear of the processors, etc), we
plan to have to cope with a certain level of non-clairvoyance in the actual
applications that will be running. We will need to develop dynamic algorithms
such as work-stealing algorithms to provide some robustness for this
non-clairvoyance. We plan to extend the work already done on co-scheduling
and to develop dynamic algorithms.
Our approach is based on the following rationale.
First, many of the basic kernels required by such sparse applications, such as
matrix vector multiplication or breadth-first search, can be formulated as large
``packs'' of fine-grained independent tasks whose performance depends on
reducing the cost of data movement between
tasks. Now on NUMA architectures, we seek models
of ``data affinity aware'' scheduling so that access latencies are reduced by
scheduling to increase reuse in shared caches through increased temporal
Second, we have found that dynamic schedules are often needed to overcome issues
of core variability where NUMA-aware work-stealing leads to significant performance
We seek theoretical models and complexity results that can improve our
understanding of this observation to provide scheduling frameworks that could be
broadly used for such applications.
Third, we have found that when workloads of such scientific applications are
mapped to multicores, there is differential impact on applications because of
interference in the memory system or potential to increase throughput by
co-scheduling specific applications together. Consequently, we seek
better models of workload scheduling to decrease workload makespan for
corresponding savings in energy efficiencies.
We detail below how we plan to initiate the work.
Record of activities:
April 5-7: Visit of Guillaume Aupy in Lyon to discuss with Anne Benoit, Yves Robert and Loic Pottier and initiate the work on co-scheduling algorithms for cache-partitioned systems. This visit was financed by Vanderbilt University.
May 16-17: Visit of Yves Robert in Vanderbilt University, just before the workshop of May 18-20.
May 18-20: Guillaume Aupy, Padma Raghavan and Yves Robert jointly organized a workshop, the 11th Scheduling for Large Scale Systems Workshop, held at Vanderbilt University in Nashville Tennessee. This workshop, by invitation only, was structured as a set of thematic half-day sessions, mainly focused on scheduling and algorithms for large-scale systems. This workshop was organized in the context of the Keystone Associate Team, and was a great opportunity for members of the team to discuss current work within the Associate Team.
June 12 - July 1: Visit of Guillaume Aupy in Lyon (financed by Keystone), during which we were able to better formalize the co-scheduling problem and make great progress on it.
August 15-20: ICPP conference in Philadelphia, USA, where Guillaume Aupy, Anne Benoit, Loic Pottier and Yves Robert all went and spent some time working on Keystone together. The travels were financed with other fundings.
September 1-2: Visite of Yves Robert in Vanderbilt University to work with Guillaume Aupy and Padma Raghavan.
October 3-6: Anne Benoit, Padma Raghavan and Yves Robert attended the Workshop on Clusters, Clouds, and Data for Scientific Computing in La Maison des Contes, Dareizé, France. The travels were financed with other fundings.
November 14-20: Guillaume Aupy, Anne Benoit, Padma Raghavan and
Yves Robert all attended SC'16 in Salt Lake City and made advances to
the Keystone project. The travels were financed with other fundings.