Inria ROMA project-team: Anne Benoit, Yves Robert, Aurélien Cavelan, Loic Pottier
Vanderbilt University (previously Computer Science and
Engineering, Pennsylvania State University): Padma Raghavan,
Guillaume Aupy, Hongyang Sun (and previously Ben Heidorm, Humayun Kabir)
The Keystone project aims at investigating sparse matrix and graph problems on NUMA multicores and/or CPU-GPU hybrid models. The goal is to improve the performance of the algorithms, while accounting for failures and trying to minimize the energy consumption.
The long-term objective is to design robust sparse-linear kernels for computing at extreme scale. In order to optimize the performance of these kernels, we plan to take particular care of locality and data reuse. Finally, there are several real-life applications relying on these kernels, and the Keystone project will assess the performance and robustness of the scheduling algorithms in applicative contexts.
We believe that the complementary expertise of the two teams in the
area of scheduling HPC applications at scale (ROMA -- models and
complexity; and SSCL -- architecture and applications) is the key to the success of this associate team. We have already successfully collaborated in the past and expect the collaboration to reach another level thanks to Keystone.
Scientific progress in 2016:
This year, we have worked on two axis to advance towards the goals of \key. On one hand, we have
come up with a model of cache partitioning in order to efficiently co-schedule applications, and we have
studied this problem both from the theoretical and from the practical side. On the other hand, we have
provided an introduction to the design of scheduling algorithms to cope with faults on
large-scale parallel platforms. Co-scheduling algorithms for cache-partitioned systems:
Our major contribution during the first year is the study of cache-partitioning techniques for co-scheduling applications.
Cache-partitioned architectures allow subsections of the
shared last-level cache (LLC) to be exclusively reserved for some
applications. This technique dramatically limits interactions between applications
that are concurrently executing on a multi-core machine.
Consider $n$ applications that execute concurrently, with the objective to minimize the makespan,
defined as the maximum completion time of the $n$ applications.
Key scheduling questions are: (i)
of cache and (ii) how many processors should be given to each application?
Here, we assign rational numbers of processors to each application,
since they can be shared across applications through multi-threading.
We have provided answers to (i) and (ii) for perfectly parallel applications.
the problem is shown to be NP-complete, we give key elements to determine
the subset of applications that should share the LLC
(while remaining ones only use their smaller private cache). Building upon these results,
we design efficient heuristics for general applications.
Extensive simulations demonstrate the usefulness of co-scheduling
when our efficient cache partitioning strategies are deployed.
This collaborative work has been submitted to an international conference, and an extended version is in
preparation for submission to a journal special-issue. It corresponds to Inria RR-8965 (November 2016).
Resilience techniques for scheduling algorithms:
In the framework of Keystone, we have written a report that provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms.
We study checkpointing and show how to derive the optimal checkpointing period.
Then we explain how to combine checkpointing with fault prediction,
and discuss how the optimal period is modified when this combination is used.
Finally we follow the very same approach for the combination of checkpointing with
replication. This work corresponds to Inria RR-8971 (November 2016) and has been submitted as a chapter of a forthcoming CDER book (IEEE TCPP and NSF)
on teaching material for parallel and distributed computing.
Scientific progress in 2017:
During the second year of the associate team, we have worked on two
major axis to advance towards the goals of Keystone. Firstly, we have
further refined our model of cache partitioning in order to
efficiently co-schedule applications, and we have pursued the study of
this problem from the theoretical side. Furthermore, we have gained
access to machines on which we can experiment with cache partitioning,
and we have initiated a set of real-life experiments on linear algebra
kernels. Secondly, we have pursued some collaboration on resilience
techniques for scheduling algorithms, aiming at coping with faults on
large-scale parallel platforms.
Co-scheduling algorithms for cache-partitioned systems:
We have pursued the study of cache-partitioning techniques for
co-scheduling applications. Cache-partitioned architectures allow
subsections of the shared last-level cache (LLC) to be exclusively
reserved for some applications. This technique dramatically limits
interactions between applications that are concurrently executing on a
multi-core machine. Consider n applications that execute concurrently,
with the objective to minimize the makespan, defined as the maximum
completion time of the n applications. Key scheduling questions are:
(i) which proportion of cache and (ii) how many processors should be
given to each application? In this work, we provide answers to (i) and
(ii) for both Amdahl applications and perfectly parallel
applications. Even though the problem is shown to be NP-complete, we
give key elements to determine the subset of applications that should
share the LLC (while remaining ones only use their smaller private
cache). Building upon these results, we design efficient heuristics
for Amdahl applications. Extensive simulations demonstrate the
usefulness of co-scheduling when our efficient cache partitioning
strategies are deployed. Both a workshop paper (perfectly parallel
applications and rational number of processors) and a journal paper
(extending to Amdahl applications and integer number of processors)
have been published in 2017 with these
Following these encouraging theoretical and simulated results, we have gained access to
a machine on which it is possible to experiment with cache
partitioning. Indeed, the Cache Allocation Technology (CAT) is a
recently released Intel technology, part of the Intel Resource
Director Technology (RDT). Basically, CAT is an interface for the
operating system to group applications into classes of service
(COS). Each class of service describes the amount of resources the
applications assigned to that class can use. Resources can be amount
of cache or memory bandwidth. We are therefore able to partition the
cache, building upon this technology.
We run several applications concurrently on an Intel Xeon E5-2650L v4
with 14 processing units and a shared 35 MB LLC. Slots of 5% of the
cache can be allocated to applications. We consider six classical HPC
applications from the NAS Parallel Benchmark suite: CG (Conjugate
Gradiant), BT (Block Tridiagonal), LU (Lower-Upper symmetric
Gauss-Seidel), SP (Scalar Pentadiagonal), MG (MultiGrid) and FT (Fast
Fourier Transform). In the first results, we have clearly identified
cases where it was very useful to use the cache partitioning feature
and to judiciously decide which fraction of cache to assign to each
application. We are currently refining the model to better fit the
experimental observations. We plan to submit a paper reporting these
experimental results early next year.
Using replication to handle fail-stop and silent errors on
With Hongyang Sun joining Keystone in 2017 as a researcher at
Vanderbilt University, we have strengthen our collaboration on
resilience techniques. We have worked this year on providing a model
and an analytical study of replication as a technique to detect and
correct silent errors, as well as to cope with both silent and
fail-stop errors on large-scale platforms. Fail- stop errors are
immediately detected, unlike silent errors for which a detection
mechanism is required. To detect silent errors, many
application-specific techniques are available, either based on
algorithms (ABFT), invariant preservation or data analytics, but
replication remains the most transparent and least intrusive
technique. We explore the right level (duplication, triplication or
more) of replication for two frameworks: (i) when the platform is
subject only to silent errors, and (ii) when the platform is subject
to both silent and fail-stop errors. A higher level of replication is
more expensive in terms of resource usage but enables to tolerate
more errors and to correct some silent errors, hence there is a
trade-off to be found. Replication is combined with checkpointing and
comes with two flavors: process replication and group
replication. Process replication applies to message-passing
applications with communicating processes. Each process is
replicated, and the platform is composed of process pairs, or
triplets. Group replication applies to black-box applications, whose
parallel execution is replicated several times. The platform is
partitioned into two halves (or three thirds). In both scenarios,
results are compared before each checkpoint, which is taken only when
both results (duplication) or two out of three results (triplication)
coincide. If not, one or more silent errors have been detected, and
the application rolls back to the last checkpoint, as well as when
fail-stop errors have struck.
We have provided a detailed analytical study for all of these
scenarios, with formulas to decide, for each scenario, the optimal
parameters as a function of the error rate, checkpoint cost, and
platform size. We have also reported a set of extensive simulation
results that nicely corroborates the analytical model. Preliminary
results dealing only with silent errors have been accepted and
presented at the FTXS workshop, in conjunction with HPDC'17,
Washington DC, USA, in June 2017. An extended version considering both
fail-stop and silent errors has been submitted to JPDC.
Work program for 2018:
The long-term objective of this collaboration is to design robust
sparse-linear kernels for com- puting at extreme scale. Achieving this
goal is fundamental for HPC applications, since it is estimated that
supercomputers spend half their time solving linear systems, a vast
majority of them being sparse. However, it is a difficult goal that
requires to solve several challenges dealing with
application-specific resilience techniques, silent error detection and
mitigation, energy consumption, data locality, load-balancing, solver
robustness, etc. Most likely, next generation kernels will not
resemble current ones, they will have different preconditioners,
embedded verification mechanisms, extended work-stealing
capabilities, etc. This is a whole world of new paradigms to
explore. While we will not solve all problems by ourselves, we are
well-qualified to address some of them, due to our complementary
expertises that cover most of the algorithmic, software and hardware
aspects of this challenging goal.
During the two first years of the associate team, we have made
considerable progress on the comprehension of co-scheduling
challenges, as well as on resilience techniques. We have not yet been
able to develop locality-aware algorithms, and this is definitively in
our plans for next year. In particular, we plan to focus on developing
models for NUMA multicores and task graph models for sparse linear
algebra kernels that represent features at the appropriate detail to
enable performance improvements through efficient algorithms. Through
our experiments, we have seen that applications do not necessarily
follow our models and it seems interesting to consider a certain level
of non-clairvoyance in the actual applications that will be
running. We would therefore need to develop dynamic algorithms such as
work-stealing algorithms to provide some robustness for this
non-clairvoyance, and this is part of our work plan for next year. We
seek theoretical models and complexity results that can improve our
understanding of work-stealing to provide scheduling frameworks that
could be broadly used for such applications.
Another direction for next year is to study the impact of
co-scheduling, cache partitioning and resilience on energy consumption
to tackle the last topic targeted by Keystone. Finally, we will aim at
validating the results on real computational kernels. The experiments
started on cache partitioning are very promising towards this
Record of activities:
April 5-7: Visit of Guillaume Aupy in Lyon to discuss with Anne Benoit, Yves Robert and Loic Pottier and initiate the work on co-scheduling algorithms for cache-partitioned systems. This visit was financed by Vanderbilt University.
May 16-17: Visit of Yves Robert in Vanderbilt University, just before the workshop of May 18-20.
May 18-20: Guillaume Aupy, Padma Raghavan and Yves Robert jointly organized a workshop, the 11th Scheduling for Large Scale Systems Workshop, held at Vanderbilt University in Nashville Tennessee. This workshop, by invitation only, was structured as a set of thematic half-day sessions, mainly focused on scheduling and algorithms for large-scale systems. This workshop was organized in the context of the Keystone Associate Team, and was a great opportunity for members of the team to discuss current work within the Associate Team.
June 12 - July 1: Visit of Guillaume Aupy in Lyon (financed by Keystone), during which we were able to better formalize the co-scheduling problem and make great progress on it.
August 15-20: ICPP conference in Philadelphia, USA, where Guillaume Aupy, Anne Benoit, Loic Pottier and Yves Robert all went and spent some time working on Keystone together. The travels were financed with other fundings.
September 1-2: Visite of Yves Robert in Vanderbilt University to work with Guillaume Aupy and Padma Raghavan.
October 3-6: Anne Benoit, Padma Raghavan and Yves Robert attended the Workshop on Clusters, Clouds, and Data for Scientific Computing in La Maison des Contes, Dareizé, France. The travels were financed with other fundings.
November 14-20: Guillaume Aupy, Anne Benoit, Padma Raghavan and
Yves Robert all attended SC'16 in Salt Lake City and made advances to
the Keystone project. The travels were financed with other fundings.
January 16-18: Guillaume Aupy, Anne Benoit, Padma Raghavan
and Yves Robert attended a meeting for the Supercomputing
conference SC'17 in Charleston, USA, and they took this opportunity
to discuss current and future plans for Keystone for 2017.
May 25-27: Most of the Keystone participants gathered at the
Scheduling at scale workshop organized in Knoxville, TN (Y. Robert
was co-organizing the workshop). Three presentations were given by
May 29 - June 2: Most of the Keystone participants gathered at
the IPDPS'17 conference, Orlando, USA, where the first cache
partitioning paper was presented at the APDCM workshop.
June 5-7: Guillaume Aupy, Anne Benoit and Padma Raghavan attended
the SC'17 Program Committee meeting in Denver, USA, where they
pursued discussions on cache partitioning and resilience.
June 26-30: Aurélien Cavelan and Hongyang Sun both attended the
FTXS'17 workshop in conjunction with HPDC'17 in Washington DC, USA,
and presented the Keystone paper on resilience. They gathered a few
days after the workshop to work on the extension, that was
subsequently submitted to JPDC.
July 3-7: Guillaume Aupy visited the Inria team in Lyon to
actively discuss the cache-partitioning experimental work, refine
the models, and establish a detailed roadmap, after we obtained the
first preliminary results.
November 14-20: Guillaume Aupy, Anne Benoit, Padma Raghavan and
Yves Robert will all attend SC'17 in Denver, CO, and look forward to
other opportunities to work on our Keystone project.