Scheduling for large-scale systems

Knoxville, USA, May 13-15 2009

Back to main page

Detailed program

Day 1: Wednesday, May 13
- 9.00-9.20: Workshop introduction
- 9.20-11.40: Session 1 - Scheduling
  - 9.20: Erik Saule - A Moldable Online Scheduling Algorithm and Its Application to Parallel Short Sequence Mapping. Abstract. Slides.
  - 10.00: Domingo Gimenez - Application of metaheuristics to task-to-processors assignation problems. Abstract. Slides.
  - 10.40: Coffee break.
  - 11.00: Pierre-Francois Dutot - Oracle-based approximation algorithms for the discrete resource sharing scheduling problem. Abstract. Slides.
- 11.40-13.40: Lunch.
- 13.40-15.00: Session 2 - Stochastic scheduling
  - 13.40: Emmanuel Jeannot - Bi-Objective Approximation Scheme for Makespan and Reliability Optimization on Uniform Parallel Machine. Abstract. Slides.
  - 14.20: Arnold Rosenberg - On "exploiting" node-heterogeneous clusters optimally. Abstract. Slides.
- 15.00-17.20: PhD Session 1
  - 15.00: Louis-Claude Canon - A Meta-Greedy Approach applied to a Multiobjectives Scheduling Problem. Abstract. Slides.
  - 15.20: Nicolas Gast - An asymptotic method for optimization in stochastic systems and applications to resource allocation problems. Abstract. Slides.
  - 15.40: Coffee break.
  - 16.00: Fanny Dufosse - On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms. Abstract. Slides.
  - 16.20: Loic Magnan - Scheduling algorithms for workflow optimization. Abstract. Slides.
  - 16.40: Alexandru Dobrila - Throughput optimization for micro-factories subject to failures. Abstract. Slides.
  - 17.00: Ron Chiang - Toward Understanding Heterogeneity in Computing. Abstract. Slides.
- 18.00: Dinner in room 233.
- 19.15: Social event: Knoxville's sunsphere.
- 20.30: More social event: drinks at Maplehurst Inn.
Day 2: Thursday, May 14
- 9.20-11.40: Session 3 - Multicriteria scheduling
  - 9.20: Anne Benoit - Mapping Filtering Streaming Applications with Communication Costs. Abstract. Slides.
  - 10.00: Coffee break.
  - 10.20: Uwe Schwiegelshohn - Online Scheduling with QoS Constraints. Abstract. Slides.
  - 11.00: Lionel Eyraud-Dubois - Allocation of Clients to Multiple Servers on Large Scale Heterogeneous Platforms. Abstract. Slides.
- 11.40-13.20: Lunch.
- 13.20-14.40: PhD Session 2
  - 13.20: Fengguang Song - A Scalable Multicast Scheme for Distributed DAG Scheduling. Abstract. Slides.
  - 13.40: Benjamin Depardon - Self-Stabilizing k-clustering for weighted graphs. Abstract. Slides.
  - 14.00: Matthew Nabity - Look ahead technique for reduction to Hessenberg form: design of the algorithm and applicability on current hardware. Abstract. Slides.
  - 14.20: Mathias Jacquelin - Steady-State Scheduling on CELL. Abstract. Slides.
- 14.40: Coffee break.
- 15.00-17.00: Session 4 - Multicore
  - 15.00: Emmanuel Agullo - Comparative Study of One-Sided Linear Algebra Factorizations with Open-Source and Equivalent Vendor Software on Various Multi-Core Hardware. Abstract. Slides.
  - 15.40: Piotr Luszczek - Generic Dynamic Scheduler for Numerical Libraries on Multicore Processors. Abstract. Slides.
  - 16.20: Richard Vuduc - Asynchronous Parallel Dense Linear Algebra in the Concurrent Collections Programming Model. Abstract. Slides.
- 17.30: Bus departure for the surprise dinner (from ICL).
Day 3: Friday, May 15
- 9.00-12.00: Session 5 - Numerical Linear Algebra
  - 9.00: Bora Ucar - Towards parallel bipartite matching algorithms. Abstract. Slides.
  - 9.40: Abdou Guermouche - Memory-Aware Scheduling for Sparse Direct Methods. Abstract. Slides.
  - 10.20: Coffee break.
  - 10.40: Padma Raghavan - Energy-Aware Scheduling for Scalable Matrix Computations. Abstract. Slides.
  - 11.20: Julien Langou - Fault Tolerant Linear Algebra: goals and methods. Abstract. Slides.
- 12.00: Workshop conclusion and lunch.

List of presentations

Wednesday, May 13
- 9.20: Erik Saule - A Moldable Online Scheduling Algorithm and Its Application to Parallel Short Sequence Mapping.
  A crucial step in DNA sequence analysis is mapping sequences generated by next-generation sequencing instruments to a reference genome. To respond to ever increasing need, we are building a Grid Service to serve multi-user parallel short sequence mapping queries. We focus on efficient online scheduling of such queries on a multiprocessor system. With the availability of parallel execution models, the problem at hand becomes a moldable task scheduling problem where the number of processors needed to execute a task is determined by the scheduler. We propose an online scheduling algorithm to minimize the stretch of the tasks in the system. This metric provides improved fairness to small tasks compared to flow time metric and suits well to the nature of the problem. Results on two workload scenarios indicate that the algorithm results in significantly smaller stretch compared to a recent algorithm and it is more fair to small sized tasks.
- 10.00: Domingo Gimenez - Application of metaheuristics to task-to-processors assignation problems.
  The optimization of the execution time of a parallel algorithm can be achieved through the use of a function representing the running time. The cost function includes a set of parameters that model the behaviour of the system and the algorithm. In order to reach an optimal execution, some of these parameters must be fitted according to the input problem and to the target architecture. An optimization problem is stated where the modelled execution time for the algorithm is used to estimate the parameters. Due to the large number of parameters in the model, analytical minimization and exhaustive search techniques are discarded. The problem becomes particularly difficult in complex systems hosting a large number of heterogeneous processors solving non trivial scientific applications. The use of metaheuristics allows valid approaches to be developed to solve general problems with a large number of parameters. In this presentation we summarize our experience in the solution of task-to-processors assignation problems in which the analytical function is optimized with metaheuristic methods. The use of a common algorithmic scheme and class hierarchy contribute to ease the development, tuning, hybridation and experimentation with different metaheuristics.
- 11.00: Pierre-Francois Dutot - Oracle-based approximation algorithms for the discrete resource sharing scheduling problem.
  In this talk I will introduce both the the discrete resource sharing scheduling problem and the methodology we used to find approximpation algorithms for this problem. This methodology has been formalized in order to provide a new way to look for approximation algorithms. In some cases PTAS can be designed using this methodology.
- 13.40: Emmanuel Jeannot - Bi-Objective Approximation Scheme for Makespan and Reliability Optimization on Uniform Parallel Machine.
  We study the problem of scheduling independent tasks on a set of related processors which have a probability of failure governed by an exponential law. We are interested in the bi-ob jective analysis, namely simultaneous optimization of the makespan and the reliability. We study its complexity, provide exact algorithm in some cases, approximation algorithm in some other and heuristic in the general case.
- 14.20: Arnold Rosenberg - On "exploiting" node-heterogeneous clusters optimally.
  Joint work with Micah Adler and Ying Gong.
  We prove that FIFO worksharing protocols provide asymptotically optimal solutions to the Cluster-Exploitation Problem (CEP), wherein one seeks to accomplish as much work as possible on a node-heterogeneous cluster C during a prespecified fixed period of L time units. The worksharing protocols that we study send work to each of C's computers in a single "bundle," and they require each computer to return the results from its work in a single "bundle." The protocols are crafted within a model that characterizes C via parameters that measure C's computers' computational and communicational powers and that allow only one message at a time to be in transit. A protocol observes a FIFO regimen if it has C's computers finish their assigned work, and return their results, in the same order in which they are supplied with work. FIFO protocols solve the CEP optimally in that they accomplish at least as much work as any other protocol during all sufficiently long worksharing episodes. Simulation experiments illustrate that the superiority of FIFO protocols is often observed during worksharing episodes of only a few minutes' duration. Protocols that solve the CEP optimally can easily be adapted to provide optimal solutions to the Cluster-Rental Problem, wherein one seeks to complete W units of work by "renting" C for as short a time as necessary.
- 15.00: Louis-Claude Canon - A Meta-Greedy Approach applied to a Multiobjectives Scheduling Problem.
  Scheduling problems can often be tackled with respect to several objectives. The makespan, the reliability, the stretch are examples of such objectives. In this case, there can be several incomparable Pareto-optimal solutions. We propose an approach that refines an EFT scheduling heuristic by considering every objectives at each step. We show that the modified method is able to generate, in a single execution, a set of non-dominated solutions that approximates the set of Pareto-optimal solutions.
- 15.20: Nicolas Gast - An asymptotic method for optimization in stochastic systems and applications to resource allocation problems.
  We present a generic framework for optimization in large-scale stochastic systems under mean field interactions. In particular, we show that when the size of the system grows, the average behavior of the system goes to a deterministic limit. This provides a way to compute optimal policies for various stochastic problems. We further provide insights on the speed of convergence with an explicit formula.
  Then, our framework is applied to a brokering problem in grid computing. The optimal policy for the deterministic limit is computed. We compare the performance of this policy applied to the initial stochastic system with classical policies (such as Join the Shortest Queue). This example shows that our model's performance is good even for systems of small size.
- 16.00: Fanny Dufosse - On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms.
  In this paper, we explore the problem of mapping filtering services on large-scale heterogeneous platforms. Two important optimization criteria should be considered in such a framework. The period, which is the inverse of the throughput, measures the rate at which data sets can enter the system. The latency measures the response time of the system in order to process one single data set entirely. Both criteria are antagonistic. For homogeneous platforms, the complexity of period minimization is already known; we derive an algorithm to solve the latency minimization problem in the general case with service precedence constraints; for independent services we also show that the bi-criteria problem (latency minimization without exceeding a prescribed value for the period) is of polynomial complexity. However, when adding heterogeneity to the platform, we prove that minimizing the period or the latency becomes NP-hard, and that these problems cannot be approximated by any constant factor (unless P=NP). The latter results hold true even for independent services. We provide an integer linear program to solve both problems in the heterogeneous case with independent services. For period minimization on heterogeneous platforms, we design some efficient polynomial time heuristics and we assess their relative and absolute performance through a set of experiments. For small problem instances, the results are very close to the optimal solution returned by the integer linear program.
- 16.20: Loic Magnan - Scheduling algorithms for workflow optimization.
  Pipelined workflows are a popular programming paradigm for parallel applications. In these workflows, the computation is divided into several stages and these stages are connected to each other through first-in first-out channels. In order to execute these workflows on a parallel machine, we must first find the mapping of these stages on the various processors on the machine. After computing the mapping, we must compute the schedule, that is, the order in which the various stages execute on their assigned processors.
  In this talk, we present results on scheduling of linear workflows, given the mapping. Linear workflows are a special case of workflows where the stages can be represented by a linear graph. The objective of the scheduling algorithms is to either maximize throughput or minimize latency or both. We consider two realistic models: the one-port model and the multi-port model. In both models, finding a schedule to minimize latency is trivial. However, computing the schedule to minimize period (maximize throughput) is NP-Hard in the one-port model, but can be computed in polynomial time in the multi-port model. We also present an approximation algorithm minimizing period in the one-port model.
- 16.40: Alexandru Dobrila - Throughput optimization for micro-factories subject to failures.
  We discuss the problem of optimizing the throughput for micro-factories subject to failures. The challenge consists in mapping several tasks onto a set of machines. The originality of our approach is the failure model for such applications in which tasks are subject to failures rather than machines. If there is exactly one task per machine in the mapping, then we prove that the optimal solution can be computed in polynomial time. However, the problem becomes NP-hard if several tasks can be assigned to the same machine. Several polynomial time heuristics are presented for the most realistic specialized setting, in which tasks of a same type can be mapped onto the same machine. Experimental results show that the best heuristics obtain a good throughput, much better than the throughput obtained with a random mapping. Moreover, we obtain a throughput close to the optimal solution in the particular cases on which the optimal throughput can be computed.
- 17.00: Ron Chiang - Toward Understanding Heterogeneity in Computing.
  Heterogeneity certainly complicates the efficient use of computing platforms, but does it enhance their performance? their cost effectiveness? How can one measure the power of a heterogeneous assemblage of computers ("cluster," for short), both in absolute and relative terms (i.e., comparing the power of two clusters)? Is one better off with a cluster that has one super-fast computer and the rest just "average" in speed, or with a cluster all of whose computers are moderately fast? If you could afford to speed up up just one computer in your cluster, which one would you choose: the fastest? the slowest? How does one even ask questions such as these in a formal, yet tractable framework? A framework is proposed, and some answers are derived, a few rather surprising.
Thursday, May 14
- 9.20: Anne Benoit - Mapping Filtering Streaming Applications with Communication Costs.
  In this talk, we explore the problem of mapping filtering streaming applications on large-scale homogeneous platforms, with a particular emphasis on communication models and their impact. Filtering application are streaming applications where each node also has a selectivity which either increases or decreases the size of its input data set. This selectivity makes the problem of scheduling these applications more challenging than the more studied problem of scheduling ``non-filtering'' streaming workflows.
  We identify three significant realistic communication models. For each of them, we address the complexity of the following important problems: (i) Given an execution graph, how can one compute the period and latency? A solution to this problem is an operation list which provides the time-steps at which each computation and each communication occurs in the system. (ii) Given a filtering workflow problem, how can one compute the schedule that minimizes the period or latency? A solution to this problem requires generating both the execution graph and the associated operation list.
  Altogether, with three models, two problems and two objectives, we present 12 complexity results, thereby providing solid theoretical foundations for the study of filtering streaming applications.
- 10.20: Uwe Schwiegelshohn - Online Scheduling with QoS Constraints.
  We discuss an online scheduling problem with parallel identical machines and a quality-of-service (QoS) constraint. The QoS constraint is represented with the help of a special form of machine eligibility constraints: These constraints must satisfy the condition that for any two machine eligibility sets M_i and M_j, one of the sets is a subset of the other. It is well known that the least flexible job first rule (LFJ) produces an optimal solution for the problem P_m|M_j|C_max if the machine eligibility constraints satisfy the inclusion property as in the QoS case. However, LFJ scheduling cannot guarantee optimality in an online scenario as newly submitted jobs may be less flexible than the previously scheduled job. Therefore, we address the performance of arbitrary list scheduling. First, we assume a very large number of machines and a very large number of jobs with unique processing times, that is, we consider a continuous version of this scheduling problem. If jobs can be allocated on all eligible machines there is no constant competitive ratio for list scheduling. Therefore, we suggest allocating a job only to a fixed percentage of eligible machines and prohibiting the allocation of the job to machines with the highest QoS property. This approach leads to a linear differential equation of degree 1. By solving this equation, we show that 50% eligibility guarantees a competitive ratio of 4 which is the best result for this approach. With the help of the online results for P_m| |C_max, we can then obtain a solution for discrete scheduling problems.
- 11.00: Lionel Eyraud-Dubois - Allocation of Clients to Multiple Servers on Large Scale Heterogeneous Platforms.
  We consider the problem of allocating a large number of independent, equal-sized tasks to a heterogeneous large scale computing platform. We model the platform using a set of servers (masters) that initially hold (or generate) the tasks to be processed by a set of clients (slaves). All resources have different speeds of communication and computation and we model contentions using the bounded multi-port model. Under this model, a processor can be involved simultaneously in several communications, provided that its incoming and outgoing bandwidths are not exceeded. This model corresponds well to modern networking technologies, but for the sake of realism, we introduce another parameter that bounds the number of simultaneous connections that can be opened at a server node. We prove this additional parameter makes the problem of maximizing the overall throughput (the fractional number of tasks that can be processed within one time-unit) NP-Complete. This result is closely related to results on bin packing with splittable items and cardinality constraints. We also propose a polynomial time algorithm based on a small resource augmentation : it achieves optimal throughput while using only one more connection per server node. This algorithm also provides a very good approximation for the dual problem of minimizing the maximal number of connections that need to be opened in order to achieve a given throughput. Finally, we also propose extensive simulations to assess the performance of the proposed algorithm.
- 13.20: Fengguang Song - A Scalable Multicast Scheme for Distributed DAG Scheduling.
  We have designed an application-level non-blocking multicast scheme for dynamic DAG scheduling on large-scale distributed-memory systems. The multicast scheme takes into account both network topology and space requirement of routing tables to achieve scalability. Specifically, we prove that the scheme is deadlock-free and takes at most logN steps to complete. The routing table chooses appropriate neighbors to store based on topology IDs and has a small space of O(logN). Our experimental results show that the scheme is significantly better than the simple flat-tree method and is close to vendor's collective MPI operations.
- 13.40: Benjamin Depardon - Self-Stabilizing k-clustering for weighted graphs.
  Scheduling efficiently jobs on a platform, often require a partitionning of the platform according to a given metric such as the latency between nodes. Latency costs within such infrastructure can be improved, or at least bounded, by using a k-clustering. A k-clustering of a graph, is a partition of the nodes into disjoint sets, called clusters, in which every node is distance at most k from a designated node in its cluster, called the clusterhead. As Grids are distributed, changing, and error prone environments, we designed a self-stabilizing asynchronous distributed algorithm for constructing a k-clustering of a connected network of processes with unique IDs and weighted edges. The algorithm is comparison-based, takes O(nk) time, and uses O(log(n) + log(k)) space per process, where n is the size of the network. To the best of our knowledge, this is the first distributed solution to the k-clustering problem on weighted graphs.
- 14.00: Matthew Nabity - Look ahead technique for reduction to Hessenberg form: design of the algorithm and applicability on current hardware.
  The classic approach to solve the nonsymmetric eigenvalue problem is to first reduce a nonsymmetric (square) matrix to Hessenberg form and then to perform the Hessenberg QR algorithm. This talk is concerned with the first part: the reduction to Hessenberg form. Reduction to Hessenberg form is a challenging task on modern computers. This operation belongs to the class of two-sided transformations. The algorithm has n steps where n is the size of the initial matrix and, at step k, a linear transformation needs to be applied on the left and on the right of a matrix of order (n-k). This implies huge constraints on the memory bandwidth of the machine. Dongarra, Sorensen and Hammarling [1] proposed a block algorithm for the reduction to Hessenberg form. The algorithm consists of a loop of panel factorizations and updates of the trailing matrix. Updates of the trailing matrix are performed with Level-3 BLAS and encompass a major part of the operations, which gives a somewhat efficient algorithm. A slight variant (more economic) is used in LAPACK. Unfortunately, the panel factorization is based on Level-2 BLAS and can consume a considerable amount of time on a multi-core platform. In this presentation we extend the idea of Dongarra, Sorensen and Hammarling from blocking to look-ahead. The idea of look-ahead is to overlap panel factorization and update. We trade flops, memory copy and memory space for this overlap. It is understood that look-ahead does not bring tremendous speed-up. Since there are only two tasks (panel and update), the best speed up we are targeting is a factor of two. Moreover, on multi-core architecture, the panel factorization is considerably longer than the update and therefore we predict this technique to bring a very small speedup if any in that basic setup. However recent experiments from Tomov [2] show that the panel factorization can be performed efficiently on GPU. In this hybrid context, where one unit is designed for panel factorization (GPU) and another unit is designed for update (multi-core), the look ahead technique offers an interesting alternative to the current LAPACK algorithm. Further development is needed to potentially make this approach effective on an-only-multi-core architecture. For example, one can think of changing the data structure of the matrix. One data structure designed for the update and another for the panel factorization yielding in some sense, a hybrid data structure. Overall, we do not have a magic solution to solve the reduction to Hessenberg form on multi-core platform. We indeed think that the basic algorithm is as good as it can get in the big O sense. In this work, we improve data movement efficiency by playing on data structure and/or computing component characteristics but unfortunately the data to be moved (bandwidth strain) to go from dense to Hessenberg form is the same as the classic algorithm or the LAPACK algorithm in the big O sense. [1] J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math., 27(1-2):215-227, 1989. Reprinted in Parallel algorithms for numerical linear algebra, 215-227, North-Holland, Amsterdam, 1990. [2] S. Tomov. Personal communication.
- 14.20: Mathias Jacquelin - Steady-State Scheduling on CELL.
  During this talk, we will present ongoing work on steady-state scheduling for the CELL processor. We focus on stream processing applications that consist in many different pipelined tasks organized as a directed graph, and we allocate these tasks to the different processing elements of the CELL, in order to ensure an optimal throughput. We have implemented such an application (vocoder). We will present the theoretical foundations of this work as well as preliminary performance results.
- 15.00: Emmanuel Agullo - Comparative Study of One-Sided Linear Algebra Factorizations with Open-Source and Equivalent Vendor Software on Various Multi-Core Hardware.
  The emergence and continuing use of multi-core architectures requires changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is a project that aims at achieving both high performance and portability across a wide range of multi-core architectures. I will present a comparative study of PLASMA's performance against established linear algebra packages (LAPACK and ScaLAPACK), new approaches at parallel execution (Task Based Linear Algebra Subroutines - TBLAS) as well as equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on three one-sided linear algebra factorizations~(LU, QR and Cholesky) and used two multi-core architectures (based on Intel Xeon EMT64 and IBM Power6). The performance results show improvements brought by new algorithms on up to 32 cores - the largest system we could access.
- 15.40: Piotr Luszczek - Generic Dynamic Scheduler for Numerical Libraries on Multicore Processors.
  Different approaches are briefly discussed for implementing numerical software on multicore processors, including the static scheduling employed currently by the PLASMA project, as well as dynamic scheduling using Cilk and SMPSs. Declarative programming is also considered. Performance comparisons are presented. A presentation of ICL's dynamic scheduler follows. The scheduler follows closely SMPSs in methodology. The DAG of the computation is explored by a sliding-window and tasks are scheduled according to data dependencies. RAW, WAR and WAW data hazards are considered. Unlike SMPSs the scheduler relies exclusively on an API, without any use of compiler technology. Finally, opportunities are discussed for customizations specific to numerical computing.
- 16.20: Richard Vuduc - Asynchronous Parallel Dense Linear Algebra in the Concurrent Collections Programming Model.
  We apply a novel and general programming model, the Intel Concurrent Collections (CnC), to the implemention of asynchronous parallel dense linear algebra algorithms on current multicore and future manycore systems. In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. We implement and evaluate CnC versions of both a recently proposed asynchronous parallel algorithm for computing a Cholesky factorization by Buttari, et al., as well as a novel higher-level asynchronous eigensolver for dense symmetric matrices. Given a well-tuned sequential BLAS, our implementations match or exceed competing hand-tuned codes by up to 2x. We include experimental evaluations against predominantly bulk-synchronous approaches in alternative models, including ScaLAPACK with a shared memory MPI, OpenMP, and Cilk++. Looking forward, we identify new opportunities to improve the CnC language and run-time scheduling and execution. This talk is based on the work of Aparna Chandramowlishwaran (Georgia Tech), in collaboration with myself and Kathleen Knobe (Intel).
Friday, May 15
- 9.00: Bora Ucar - Towards parallel bipartite matching algorithms.
  We consider matchings yielding maximum diagonal products in sparse matrices. An optimal matching is invariant under suitable scaling. We use scaling algorithms to obtain at least one 1.0 per each row and column. We find a maximum cardinality matching using only the entries of magnitude 1.0. If the matching is perfect, we are done. If not, we select a special entry and rescale the matrix. We repeat this process until a perfect matching is found. Such an approach is guaranteed to terminate with an optimal matching, however, can be painfully slow. We will discuss alternatives where optimality is sacrificed for practicality.
- 9.40: Abdou Guermouche - Memory-Aware Scheduling for Sparse Direct Methods.
  Improving the memory behaviour is critical to solve huge systems of sparse linear equations using parallel direct methods. Indeed, the limiting factor of such methods to treat arbitrarily large problems is always memory (rather than performance). To tackle this problem, many studies and algorithms have been proposed in the past. The main approach that allows a solver to process very large problems is the use of disks to extend the memory (out-of-core execution scheme). Even then, the memory scalability of what remains in core memory is critical and deserves a special attention. In this talk, we propose a new way of mapping tasks to processors that achieves a high memory scalability. We present some preliminary results using our new algorithm within the MUMPS parallel sparse direct solver. Finally, we give some hints about our current and future work, focussing on the use of multicore architectures and of specific computing devices like GPUs.
- 10.40: Padma Raghavan - Energy-Aware Scheduling for Scalable Matrix Computations.
  As we approach parallel computing at the petascale and beyond with ensembles of many core chip multiprocessors, a key challenge concerns sustainable energy-aware performance scaling of matrix computations. We consider how this challenge can be addressed by scheduling to explicitly reduce energy while improving performance. We discuss initial results on the energy and performance impacts of task-to-processor node mapping and thread-to-core mapping at a processor node for parallel sparse linear system solvers.
- 11.20: Julien Langou - Fault Tolerant Linear Algebra: goals and methods.
  Our goal is to develop algorithms for dense linear algebra that are efficient and fault tolerant. Given a dense linear algebra computation, we would like to detect, locate and correct any given set of errors. We will compare the computational cost of two approaches. The first approach (RCRE) is a combination of two known techniques. ``Computing Error'' are handled by ``Residual Checking and Rollback''. ``Data Error'' are handled by ``Encoding''. These techniques have been shown to be the most effective for their respective task. The second approach is based on ``Algorithm-Based Fault Tolerance (ABFT).'' To make ABFT practical for our goal, we add two key ingredients. First ingredient is ``small step.'' Second ingredient is ``simultaneous use of row and column checksums.'' We conclude our study with a comparison of the two afore mentionned methods (RCRE vs ABFT). We compare how the methods behave in the presence of Data Error and Computing Error and discuss their computational costs.

Contact

Anne Benoit