17th Scheduling for large-scale systems workshop

17th Scheduling for large-scale systems workshop

Aussois, France, June 24 - 27, 2024

Slides and abstracts

Day 1: Monday, June 24
- 16.30-18.00: Session 1
  - Bruno Gaujal - The optimal frequency of a processor - Slides
    Abstract: In this talk, we consider the execution of a single task with unknown size on top of a service system that offers a limited number of processing speeds, say N, and investigate the problem of finding a speed profile that minimizes the resulting energy consumption subject to a deadline constraint. Existing works mainly investigated this problem when unbounded speed profiles. In contrast, we consider discontinuous speed profiles, i.e., a case that arises naturally when the underlying computational platform offers a finite number of speeds. We show that the computation of an optimal speed profile boils down to solving a convex optimization problem. Under mild assumptions, we prove some structural results that yield the formulation of an extremely efficient solution algorithm. Specifically, we show that the optimal speed profile can be computed by solving O(logN) one-dimensional equations.
  - Olivier Beaumont - An Approximation Algorithm for Scheduling with Rejection Costs Proportional to Processing Times - Slides
    Abstract: We address an offline job scheduling problem where jobs can either be processed on a limited supply of energy-efficient machines, or offloaded to energy-inefficient machines (with an unlimited supply), and the goal is to minimize the total energy consumed in processing all tasks. This scheduling problem can be formulated as a problem of scheduling with rejection, where rejecting a job corresponds to process it on an energy-inefficient machine and has a cost directly proportional to the processing time of the job. To solve this scheduling problem, we introduce a novel $\frac{5}{4}(1+\epsilon)$ approximation algorithm $\BEKP$ by associating it to a Multiple Subset Sum problem. Our algorithm is an improvement over the existing literature, which provides a ($\frac{3}{2} - \frac{1}{2m}$) approximation for scheduling with arbitrary rejection costs. We evaluate and discuss the effectiveness of our approach through a series of experiments, comparing it to existing algorithms.
  - Alix Munier - Design of a pipelined architecture for the execution of a Neural Network - Slides
    Abstract: In this presentation, we introduce an optimization method for designing a pipelined architecture for inference execution of a feed-forward neural network. The computational elements are AI accelerators whose performance (latency, area, power) we have evaluated using High Level Synthesis tools. The problem consists of setting the architecture parameters based on a fixed neural network while ensuring both a minimum throughput and minimizing the area or power. The problem is polynomially solved using a classical dynamic programming approach. We also present experimental results that illustrate the benefits of the method.
Day 2: Tuesday, June 25
- 9.00-10.30: Session 2
  - Mathieu Faverge - Dynamic Tasks Scheduling with Multiple Priorities on Heterogeneous Computing Systems - Slides
    Abstract: The efficient utilization of heterogeneous computing systems is crucial for scientists and industrial organizations to execute computationally intensive applications. Task-based programming has emerged as an effective approach for harnessing the processing power of these systems. However, effective scheduling of task-based applications is critical to achieving high performance. Typically, these applications are represented as directed acyclic graphs (DAGs), which can be optimized through careful scheduling to minimize execution time and maximize resource utilization. In this paper, we introduce MultiPrio, a dynamic task scheduler that aims to minimize the overall completion time of parallelized task-based applications. The goal is to find a trade-off between resource affinity, task criticality, and workload balancing on the resources. To this end, we compute scores for each task and manage the available tasks in the system with a data structure based on a set of priority queues. Tasks are assigned to available resources according to these scores, which are dynamically computed by heuristics based on task affinity and criticality. We also consider workload balancing across resources and data locality awareness. To evaluate the scheduler, we study the performance of dense and sparse linear algebra task-based applications and task-based FMM application using the StarPU runtime system on heterogeneous nodes. Our scheduler shows interesting results compared to other state-of-the-art schedulers in StarPU and successfully achieves performance gains for irregular workloads.
  - Abdou Guermouche - Dynamic Task Graph Adaptation with Recursive Tasks - Slides
    Abstract: The efficiency of both heterogeneous and homogeneous parallel systems can be significantly improved by using task-based programming models. Among these models, the Sequential Task Flow (STF) model is widely embraced since it efficiently handles task graphs while offering ample optimization perspectives. However, STF is constrained to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. For instance, in heterogeneous systems, the optimal task size varies across different processing units. StarPU's recursive tasks allow graphs with several tasks granularities by turning some tasks into subgraphs dynamically at runtime. We introduce a new component called the splitter, positioned between task submission and scheduling, which decides when to split tasks into subgraphs. Its splitting policy relies on an area-based bound aimed at minimizing execution time. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on GPUs. We present an evaluation on both homogeneous and heterogeneous architectures. These results show that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.
  - Hatem Ltaief - HPC in Healthcare - Slides
    Abstract: We exploit the widening margin in peak tensor core performance between [FP64/FP16/INT8,FP64/FP8/INT8] on NVIDIA [A100,H100] GPUs to boost the performance of output accuracy-preserving mixed-precision computation of multivariate Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with new low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate the KRR matrix generation by redesigning the norm computation for Euclidean distances to engage INT8 tensor cores while exploiting its symmetry. We accelerate solution of regularized KRR systems of equations by deploying a new four-precision Cholesky-based solver, which exceeds the FP64-only KRR throughput by 11X on A100s and outperforms the state-of-the-art REGENIE software by 133X.
- 11.00-12.00: Session 3
  - Valentin Le Fèvre - Stencil computations on Tensor Cores - Slides
    Abstract: In this talk, we take a look at the possibilities given by Tensor Cores, hardware specialized in matrix-matrix computations embedded in recent GPUs, to enhance the performance of scientific applications. We focus on stencil computations, a common pattern used in seismic imaging for example. We first review the evolution of Tensor Cores in recent years both in terms of performance and supported data types. Then, we review two recent publications aiming at optimizing stencil computations on Tensor Cores through a hand-written kernel. Finally, we present a new formulation of the stencil algorithm exclusively-based on GEMM operations. Promising results show that this approach 1) is feasible 2) is portable on any GPU architecture by relying on vendor-optimized libraries such as cuBlas and 3) can take advantage of mixed-precision Tensor Core instructions to increase the performance.
  - Maxime Gonthier - D-rex: Dynamic Data Replication for Unreliable Heterogeneous nodes - Slides
    Abstract: Data replication is essential for preserving critical data in today's HPC systems. Erasure coding is used as a way to reduce storage overhead while maintaining a high replication factor. However, replication comes at a significant cost in terms of both storage space and time. Moreover, current replication schemes such as HDFS or GlusterFS are static and do not take into account the potential heterogeneity of storage nodes. We propose D-Rex, a dynamic data replication algorithm that uses information about system saturation and storage node failure rates to reduce the overall storage overhead and replication time when storing a set of data on a heterogeneous set of nodes.
- 15.00-16.30: Session 4
  - Jean-Marc Nicod - DATAZERO: Powering a Data Center Disconnected from the Grid - Slides
    Abstract: Given the sheer size of the world's data centers and their huge energy consumption, the approaches traditionally proposed are to optimize the scheduling of the tasks performed in order to minimize the energy consumed. To meet the challenges of the climate change, other initiatives are needed. These include the implementation of green energy sources to limit fossil fuel consumption and CO2 emissions. As part of the ANR DATAZERO and DATAZERO2 projects, several teams are working to define the main concepts of a completely green data center, powered solely by renewable energies. To achieve this goal, it is necessary to focus on the efficient management of an autonomous hybrid power system made up of solar panels, wind turbines, batteries and hydrogen systems. In this talk, we propose a model based on a mixed integer linear program to optimize the engagement of energy sources and in particular storage sources. The approach takes into account seasonality and weather forecasts at the time of optimization. The model presented is an improved version of a first version originally published in Journal Of Scheduling.
  - Georges Da Costa - Power and Performance measures for realistic HPC models - Slides
    Abstract: Energy usage is now a major concern in high performance computing (HPC) due to the ecological impact of electricity production, but also to its cost. To optimize the energy efficiency of supercomputers, researchers usually base their work on models, as accessing actual platform is complex and costly. One of the most studied method consists in changing the processor frequency, but it impacts power, performance and energy in a complex way. We propose to bridge the gap between the theoretical and the practical approaches. To do so we propose a multi cluster, multi application model accurately describing from a theoretical point of view the power and performance of applications subject to DVFS (dynamic voltage and frequency scaling). We also show how to use it on a runtime system with a minimal overhead, using only a few hardware performance counters and RAPL. We also investigate the complexity of reproducible experiments, the intrinsic heterogeneity of homogeneous datacenters and how to realistically simulate datacenters using the Replay with Feedback methodology.
  - Fanny Dufossé - Environmental impact of HPC - Slides
    Abstract:
- 17.00-18.15: Session 5
  - Adrien Obrecht - Scheduling task graphs for average memory consumption reduction - Slides
  - Alix Tremodeux - Fault-tolerant numerical iterative algorithms at scale - Slides
  - Damien Lesens - Greedy algorithms for computing the Birkhoff decomposition - Slides
Day 3: Wednesday, June 26
- 9.00-10.30: Session 6
  - Valentin Honoré - Pallas: HPC Trace Analysis at Scale - Slides
    Abstract: Traces are used in HPC for post-mortem performance analysis. It is a useful tool for investigating performance problems of applications. However, identifying a performance bottleneck often requires collecting lots of information, which causes the trace to become huge. This problem gets worse for large-scale applications that run many threads for a long time. In addition to the problem of storing these large traces, another problem arises when analyzing them to identify problems. The analysis tool needs to process gigabytes, or even terabytes of data, which is time-consuming. However, it has been shown that many HPC applications have recurring patterns, that time data is the heaviest part of a trace, and that similar events have similar duration, meaning they can be efficiently compressed. We propose a new trace format named Pallas, which uses the regularity of HPC applications to provide both quick and efficient post-mortem analysis and light traces. Pallas is a library that provides tracing tools with event storage functionalities. When writing a trace, Pallas automatically detects patterns, and stores statistical data for later analysis. The trace is then stored by separating the timestamps from the structure. This allows loading and analyzing the structure separately from the timestamps, which grants near-instantaneous analysis when the timestamps are not needed. Our implementation provides an OTF2-like API, which allows tracing tools such as EZTrace to transparently use it. We evaluate Pallas by comparing it with OTF2 and Pilgrim on several applications, and by running several types of performance analysis on execution traces. Our experiments show that Pallas reduces the size of traces in most cases, and it allows executing several types of performance analysis in near-constant time.
  - Julien Langou - The I/O Requirements of Various Numerical Linear Algebra Kernels - Slides
    Abstract: When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity is playing an increasingly important role that highly impacts performance and energy consumption. The objective of I/O complexity analysis is to compute, for a given program, its minimal I/O requirement among all valid schedules. We consider a sequential execution model with two memories, an infinite one, and a small one of size S on which the computations retrieve and produce data. The I/O is the number of reads and writes between the two memories. From this model, we review various Numerical Linear Algebra kernels that are increasingly complicated from matrix-matrix multiplication, to LU factorization, then to symmetric rank-k update, to Cholesky factorization, then to Modified Gram-Schmidt to Householer QR factorization. We will show practical examples of these results too.
  - Jonas Müller Korndörfer - DaphneSched: A Scheduler for Integrated Data Analysis Pipelines - Slides
    Abstract: DAPHNE is a new open-source software infrastructure designed to address the increasing demands of integrated data analysis (IDA) pipelines, comprising data management (DM), high performance computing (HPC), and machine learning (ML) systems. Efficiently executing IDA pipelines is challenging due to their diverse computing characteristics and demands. Therefore, IDA pipelines executed with the DAPHNE infrastructure require an efficient and versatile scheduler to support these demands. This presentation introduces DaphneSched, the task-based scheduler at the core of DAPHNE. DaphneSched is versatile by incorporating 12 task partitioning, two queue organizations, and three task assignment techniques, bringing the state-of-the-art closer to the state-of-the-practice task scheduling.
- 11.00-12.00: Session 7
  - Ana Gainaru - Strategies for querying large-scale scientific data - Slides
    Abstract: Scientific data analysis often involves complex queries across distributed datasets, requiring manipulation of multiple primary variables and generating derived data that needs to be handled efficiently. We investigate in this talk the performance of different approaches where applications define derived variables as quantities of interest (QoIs) and offload the computation and transfer of these QoIs to the I/O library. We look at a detailed analysis of the performance-storage trade-offs associated with different solutions and highlight opportunities for modeling the impact of read/write patterns on the performance of the query operation. We showcase results for our study on two large-scale datasets created from climate and combustion simulations.
  - Anthony Dugois - Solving the Restricted Assignment Problem to Schedule Multi-Get Requests in Key-Value Stores - Slides
    Abstract: Modern distributed key-value stores, such as Apache Cassandra, enhance performance through multi-get requests, minimizing network round-trips between the client and the database. However, partitioning these requests for appropriate storage server distribution is non-trivial and may result in imbalances. This study addresses this optimization challenge as the Restricted Assignment problem on Intervals (RAI). We propose an efficient (2-1/m)-approximation algorithm, where m is the number of machines. Then, we generalize the problem to the Restricted Assignment problem on Circular Intervals (RACI), matching key-value store implementations, and we present an optimal O(n log n) algorithm for RACI with fixed machines and unitary jobs. Additionally, we obtain a (4-2/m)-approximation for arbitrary jobs and introduce new heuristics, whose solutions are very close to the optimal in practice. Finally, we show that optimizing multi-get requests individually also leads to global improvements, increasing achieved throughput by 27%-34% in realistic cases compared to state-of-the-art strategy.
Day 4: Thursday, June 27
- 9.30-10.30: Session 8
  - Ahmad Tarraf - Frequency Techniques for I/O - Slides
    Abstract: Characterizing the temporal I/O behavior of an application is a challenging task. Yet, many HPC applications perform their I/O in bursts that follow a periodic pattern. System providers can leverage the knowledge about the periodicity of I/O phases to reduce file-system contention by actively scheduling I/O bandwidth or for burst buffer management. The effectiveness of such approaches, however, depends on the ability to detect and quantify the period of the I/O patterns online. To find the period of the I/O phases, we use a well-known technique from the signal processing domain, namely the discrete Fourier transformation, combined with outlier detection methods. In this talk, we will examine the approach, its applicability, and its limitations. We illustrate the approach based on extensive experiments on a large production system and demonstrate its applicability for I/O scheduling.
  - Robin Boezennec - Scheduling Strategies for HPC Batch Scheduler with Disaggregated Memory - Slides
    Abstract: In this work, we consider scheduling strategies to deal with disaggregated memory for HPC systems. Disaggregated memory is an implementation of storage management that provides flexibility by giving the option to allocate storage based on system-defined parameters. In this case, we consider a memory hierarchy that allows to partition the memory resources arbitrarily amongst several nodes depending on the need. This memory can be dynamically reconfigured at a cost. We provide algorithms that pre-allocate or reconfigure dynamically the disaggregated memory based on estimated needs. We provide theoretical performance results for these algorithms. An important contribution of our work is that it shows that the system can design allocation algorithms even if user memory estimates are not accurate, and for dynamic memory patterns. These algorithms rely on statistical behavior of jobs. We observe the impact on the performance of parameters of interest such as the reconfiguration cost.
- 10.50-11.50: Session 9
  - Joachim Cendrier - Scheduling jobs under a variable number of processors - Slides
    Abstract: Because of the growing concern about their energy consumption and impact on the environment, more and more data centers aim at using renewable energies in order to improve their sustainability. This amounts at seeing the capacity of the HPC platform, hence, the number of available processors, evolve over time. In this work, we go beyond previous approaches by allowing jobs to be checkpointed before a resource change, hence, avoiding loosing some work. We assume that the resource provider warns the user before a change in the number of processors, hence, it is possible to anticipate and take checkpoints before the change happens. The goal is then to maximize the goodput and/or the minimum yield of jobs within the next period (time between two changes in the number of processors). We model the problem and design greedy solutions and sophisticated dynamic programming algorithms. A comprehensive set of simulations building on real-life job sets demonstrates the performance of the proposed algorithms. The best solution maximizes platform utilization while ensuring a high level of fairness.
  - Guillaume Pallez - Model Design and Accuracy for Resource Management in HPC - Slides
    Abstract: In this talk, I am revisiting how one could design models for resource management in HPC. Through a longitudinal analysis of my past research, I propose and study several design hypotheses. I discuss and illustrate the fact that having a model that is too accurate may be counter-productive: it complicates the algorithm design without necessarily improving the final performance. Going further, I try to demonstrate that models should start to take into account the fact that the input information may be inaccurate. This is I believe an important paradigm shift for scheduling in HPC resource management, where previous studies have mostly focused on trying to get accurate data. Then, I discuss the fact that the optimization objectives should also be reevaluated: by focusing on optimizing a single objective, there is a risk of losing sight of what is actually happening in terms of schedule.