Scheduling for large-scale
systems
Knoxville, USA, May 13-15 2009
Back to main page
Detailed program
- Day 1: Wednesday, May 13
- 9.00-9.20: Workshop introduction
- 9.20-11.40: Session 1 - Scheduling
- 9.20: Erik Saule - A Moldable Online Scheduling Algorithm and Its
Application to Parallel Short Sequence Mapping. Abstract.
Slides.
- 10.00: Domingo Gimenez - Application of metaheuristics to
task-to-processors assignation problems. Abstract.
Slides.
- 10.40: Coffee break.
- 11.00: Pierre-Francois Dutot - Oracle-based approximation
algorithms for the discrete resource sharing scheduling
problem. Abstract. Slides.
- 11.40-13.40: Lunch.
- 13.40-15.00: Session 2 - Stochastic scheduling
- 13.40: Emmanuel Jeannot - Bi-Objective Approximation
Scheme for Makespan and Reliability Optimization on Uniform Parallel
Machine. Abstract. Slides.
- 14.20: Arnold Rosenberg - On "exploiting" node-heterogeneous
clusters optimally. Abstract. Slides.
- 15.00-17.20: PhD Session 1
- 15.00: Louis-Claude Canon - A Meta-Greedy Approach applied to a
Multiobjectives
Scheduling Problem. Abstract. Slides.
- 15.20: Nicolas Gast - An asymptotic method for optimization in stochastic systems and applications to resource allocation problems. Abstract. Slides.
- 15.40: Coffee break.
- 16.00: Fanny Dufosse - On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms. Abstract. Slides.
- 16.20: Loic Magnan - Scheduling algorithms for workflow optimization. Abstract. Slides.
- 16.40: Alexandru Dobrila - Throughput optimization for micro-factories subject to failures. Abstract. Slides.
- 17.00: Ron Chiang - Toward Understanding Heterogeneity in
Computing. Abstract. Slides.
- 18.00: Dinner in room 233.
- 19.15: Social
event: Knoxville's
sunsphere.
- 20.30: More social event: drinks at Maplehurst
Inn.
- Day 2: Thursday, May 14
- 9.20-11.40: Session 3 - Multicriteria scheduling
- 9.20: Anne Benoit - Mapping Filtering Streaming Applications
with Communication Costs. Abstract. Slides.
- 10.00: Coffee break.
- 10.20: Uwe Schwiegelshohn - Online Scheduling with QoS
Constraints. Abstract. Slides.
- 11.00: Lionel Eyraud-Dubois - Allocation of Clients to Multiple Servers on Large Scale Heterogeneous Platforms. Abstract. Slides.
- 11.40-13.20: Lunch.
- 13.20-14.40: PhD Session 2
- 13.20: Fengguang Song - A Scalable Multicast Scheme for Distributed DAG Scheduling. Abstract. Slides.
- 13.40: Benjamin Depardon - Self-Stabilizing k-clustering for weighted graphs. Abstract. Slides.
- 14.00: Matthew Nabity - Look ahead technique for reduction to Hessenberg form: design of the algorithm and applicability on current hardware. Abstract. Slides.
- 14.20: Mathias Jacquelin - Steady-State Scheduling on
CELL. Abstract. Slides.
- 14.40: Coffee break.
- 15.00-17.00: Session 4 - Multicore
- 15.00: Emmanuel Agullo - Comparative Study of One-Sided Linear
Algebra Factorizations with
Open-Source and Equivalent Vendor Software on Various Multi-Core
Hardware. Abstract. Slides.
- 15.40: Piotr Luszczek - Generic Dynamic Scheduler for Numerical Libraries on Multicore Processors. Abstract. Slides.
- 16.20: Richard Vuduc - Asynchronous Parallel Dense Linear
Algebra in the Concurrent Collections Programming
Model. Abstract. Slides.
- 17.30: Bus departure for the surprise dinner (from
ICL).
- Day 3: Friday, May 15
- 9.00-12.00: Session 5 - Numerical Linear Algebra
- 9.00: Bora Ucar - Towards parallel bipartite matching
algorithms. Abstract. Slides.
- 9.40: Abdou Guermouche - Memory-Aware Scheduling for Sparse
Direct Methods. Abstract. Slides.
- 10.20: Coffee break.
- 10.40: Padma Raghavan - Energy-Aware Scheduling for Scalable
Matrix Computations. Abstract. Slides.
- 11.20: Julien Langou - Fault Tolerant Linear Algebra: goals and
methods. Abstract. Slides.
- 12.00: Workshop conclusion and lunch.
List of presentations
- Wednesday, May 13
- 9.20: Erik Saule
- A Moldable Online Scheduling Algorithm and Its Application to
Parallel Short Sequence Mapping.
A crucial step in DNA sequence analysis is mapping sequences
generated by next-generation sequencing instruments to a reference
genome. To respond to ever increasing need, we are building a Grid
Service to serve multi-user parallel short sequence mapping queries.
We focus on efficient online scheduling of such queries on a
multiprocessor system. With the availability of parallel execution
models, the problem at hand becomes a moldable task scheduling
problem where the number of processors needed to execute a task is
determined by the scheduler. We propose an online scheduling
algorithm to minimize the stretch of the tasks in the system. This
metric provides improved fairness to small tasks compared to flow
time metric and suits well to the nature of the problem. Results on
two workload scenarios indicate that the algorithm results in
significantly smaller stretch compared to a recent algorithm and it
is more fair to small sized tasks.
- 10.00: Domingo Gimenez - Application of metaheuristics to
task-to-processors assignation problems.
The optimization of the execution time of a parallel algorithm can be
achieved through the use of a function representing the running
time. The cost function includes a set of parameters that model
the behaviour of the system and the algorithm. In order to reach
an optimal execution, some of these parameters must be fitted
according to the input problem and to the target
architecture. An optimization problem is stated where the
modelled execution time for the algorithm is used to estimate
the parameters. Due to the large number of parameters in the
model, analytical minimization and exhaustive search techniques
are discarded. The problem becomes particularly difficult in
complex systems hosting a large number of heterogeneous
processors solving non trivial scientific applications. The use
of metaheuristics allows valid approaches to be developed to
solve general problems with a large number of parameters. In
this presentation we summarize our experience in the solution of
task-to-processors assignation problems in which the analytical
function is optimized with metaheuristic methods. The use of a
common algorithmic scheme and class hierarchy contribute to ease
the development, tuning, hybridation and experimentation with
different metaheuristics.
- 11.00: Pierre-Francois Dutot - Oracle-based approximation
algorithms for the discrete resource sharing scheduling
problem.
In this talk I will introduce both the the discrete resource sharing
scheduling problem and the methodology we used to find approximpation
algorithms for this problem. This methodology has been formalized in
order to provide a new way to look for approximation algorithms. In some
cases PTAS can be designed using this methodology.
- 13.40: Emmanuel Jeannot - Bi-Objective Approximation
Scheme for Makespan and Reliability Optimization on Uniform
Parallel Machine.
We study the problem of scheduling independent tasks on
a set of related processors which have a probability of failure governed
by an exponential law. We are interested in the bi-ob jective analysis,
namely simultaneous optimization of the makespan and the reliability.
We study its complexity, provide exact algorithm in some cases, approximation
algorithm in some other and heuristic in the general case.
- 14.20: Arnold Rosenberg - On "exploiting"
node-heterogeneous clusters optimally.
Joint work with Micah Adler and Ying Gong.
We prove that FIFO worksharing protocols provide asymptotically
optimal solutions to
the Cluster-Exploitation Problem (CEP), wherein one seeks to
accomplish as much work as
possible on a node-heterogeneous cluster
C during a prespecified fixed period of L time
units. The worksharing protocols that we study send work to each of
C's computers in
a single "bundle," and they require each computer to return the
results from its work in a single "bundle." The protocols are
crafted within a model that characterizes C via parameters that
measure C's computers' computational and communicational powers
and that allow only one message at a time to be in transit. A
protocol observes a FIFO regimen if it has
C's computers finish their assigned work, and return their results, in the
same order in which they are supplied with work. FIFO protocols solve
the CEP optimally
in that they accomplish at least as much work as any other protocol
during all sufficiently
long worksharing episodes. Simulation experiments illustrate that the
superiority of FIFO
protocols is often observed during worksharing episodes of only a few
minutes' duration.
Protocols that solve the CEP optimally can easily be adapted to
provide optimal solutions
to the Cluster-Rental Problem, wherein one seeks to complete W units
of work by "renting" C for as short a time as necessary.
- 15.00: Louis-Claude Canon - A Meta-Greedy Approach applied to a Multiobjectives Scheduling Problem.
Scheduling problems can often be tackled with respect to several
objectives. The makespan, the reliability, the stretch are examples of
such objectives. In this case, there can be several incomparable
Pareto-optimal solutions. We propose an approach that refines an EFT
scheduling heuristic by considering every objectives at each step. We
show that the modified method is able to generate, in a single
execution, a set of non-dominated solutions that approximates the set of
Pareto-optimal solutions.
- 15.20: Nicolas Gast - An asymptotic method for
optimization in stochastic systems and applications to
resource allocation problems.
We present a generic framework for optimization in large-scale
stochastic systems under mean field interactions. In particular,
we show that when the size of the system grows, the average
behavior of the system goes to a deterministic limit. This
provides a way to compute optimal policies for various
stochastic problems. We further provide insights on the speed of
convergence with an explicit formula.
Then, our framework is applied to a brokering problem in grid computing.
The optimal policy for the deterministic limit is computed. We compare
the performance of this policy applied to the initial stochastic
system with classical policies (such as Join the Shortest Queue). This
example shows that our model's performance is good even for systems of
small size.
- 16.00: Fanny Dufosse - On the Complexity of Mapping
Pipelined Filtering Services on Heterogeneous
Platforms.
In this paper, we explore the problem of mapping filtering services on large-scale
heterogeneous platforms. Two important optimization criteria should be considered
in such a framework. The period, which is the inverse of the throughput, measures
the rate at which data sets can enter the system. The latency measures the response
time of the system in order to process one single data set entirely. Both criteria are
antagonistic. For homogeneous platforms, the complexity of period minimization is
already known; we derive an algorithm to solve the latency minimization problem
in the general case with service precedence constraints; for independent services we
also show that the bi-criteria problem (latency minimization without exceeding a
prescribed value for the period) is of polynomial complexity. However, when adding
heterogeneity to the platform, we prove that minimizing the period or the latency
becomes NP-hard, and that these problems cannot be approximated by any constant
factor (unless P=NP). The latter results hold true even for independent services. We
provide an integer linear program to solve both problems in the heterogeneous case
with independent services.
For period minimization on heterogeneous platforms, we design some
efficient polynomial time heuristics and we assess their
relative and absolute performance through
a set of experiments. For small problem instances, the results are very close to the
optimal solution returned by the integer linear program.
- 16.20: Loic Magnan - Scheduling algorithms for workflow
optimization.
Pipelined workflows are a popular programming paradigm for parallel
applications. In these workflows, the computation is divided into
several stages and these stages are connected to each other through
first-in first-out channels. In order to execute these workflows on
a parallel machine, we must first find the mapping of these stages on
the various processors on the machine. After computing the mapping,
we must compute the schedule, that is, the order in which the various
stages execute on their assigned processors.
In this talk, we present results on scheduling of linear workflows,
given the mapping. Linear workflows are a special case of workflows
where the stages can be represented by a linear graph. The objective
of the scheduling algorithms is to either maximize throughput or
minimize latency or both. We consider two realistic models: the
one-port model and the multi-port model. In both models, finding a
schedule to minimize latency is trivial. However, computing the
schedule to minimize period (maximize throughput) is NP-Hard in the
one-port model, but can be computed in polynomial time in the
multi-port model. We also present an approximation algorithm
minimizing period in the one-port model.
- 16.40: Alexandru Dobrila - Throughput optimization for
micro-factories subject to failures.
We discuss the problem of optimizing the throughput for
micro-factories subject to failures. The challenge consists in
mapping several tasks onto a set of machines. The originality of
our approach is the failure model for such applications in which
tasks are subject to failures rather than machines.
If there is exactly one task per machine in the mapping, then we
prove that the optimal solution can be computed in polynomial
time. However, the problem becomes NP-hard if several tasks can be
assigned to the same machine. Several polynomial time heuristics are
presented for the most realistic specialized setting, in which
tasks of a same type can be mapped onto the same machine. Experimental
results show that the best heuristics obtain a good throughput, much
better than the throughput obtained with a random mapping. Moreover,
we obtain a throughput close to the optimal solution in the
particular cases on which the optimal throughput can be computed.
- 17.00: Ron Chiang - Toward Understanding Heterogeneity in
Computing.
Heterogeneity certainly complicates the efficient use of computing
platforms, but does it enhance their performance? their cost
effectiveness? How can one measure the power of a heterogeneous
assemblage of computers ("cluster," for short), both in absolute and
relative terms (i.e., comparing the power of two clusters)? Is one
better off with a cluster that has one super-fast computer and the
rest just "average" in speed, or with a cluster all of whose
computers are moderately fast? If you could afford to speed up up
just one computer in your cluster, which one would you choose: the
fastest? the slowest? How does one even ask questions such as these
in a formal, yet tractable framework? A framework is proposed, and
some answers are derived, a few rather surprising.
- Thursday, May 14
- 9.20: Anne Benoit - Mapping Filtering Streaming
Applications with Communication Costs.
In this talk, we explore the problem of mapping filtering
streaming applications on large-scale homogeneous platforms, with
a particular emphasis on communication models and their impact.
Filtering application are streaming applications where each node
also has a selectivity which either increases or decreases
the size of its input data set. This selectivity makes the problem
of scheduling these applications more challenging than the more
studied problem of scheduling ``non-filtering'' streaming workflows.
We identify three significant realistic communication models. For
each of them, we address the complexity of the following important
problems:
(i) Given an execution graph, how can one compute the
period and latency? A solution to this problem is an operation list
which provides the time-steps at which each computation and each
communication occurs in the system.
(ii) Given a filtering workflow problem, how can one compute the schedule that
minimizes the period or latency? A solution to this problem requires
generating both the execution graph and the associated operation list.
Altogether, with three models, two problems and two objectives, we
present 12 complexity results, thereby providing solid theoretical
foundations for the study of filtering streaming applications.
- 10.20: Uwe
Schwiegelshohn - Online Scheduling with QoS
Constraints.
We discuss
an online scheduling problem with parallel identical machines and a
quality-of-service (QoS) constraint. The QoS constraint is represented with the
help of a special form of machine eligibility constraints: These constraints
must satisfy the condition that for any two machine eligibility sets Mi
and Mj, one of the sets is a subset of the other. It is well known
that the least flexible job first rule (LFJ) produces an optimal solution for
the problem Pm|Mj|Cmax if the machine
eligibility constraints satisfy the inclusion property as in the QoS case.
However, LFJ scheduling cannot guarantee optimality in an online scenario as
newly submitted jobs may be less flexible than the previously scheduled job.
Therefore, we address the performance of arbitrary list
scheduling. First, we assume
a very large number of machines and a very large number of jobs with unique
processing times, that is, we consider a continuous version of this
scheduling problem.
If jobs can be allocated on all eligible machines there is no constant
competitive ratio for list scheduling. Therefore, we suggest allocating a job
only to a fixed percentage of eligible machines and prohibiting the allocation
of the job to machines with the highest QoS property. This approach leads to a
linear differential equation of degree 1. By solving this equation, we show
that 50% eligibility guarantees a competitive ratio of 4 which is the best
result for this approach. With the help of the online results for Pm|
|Cmax, we can then obtain a solution for discrete scheduling
problems.
- 11.00: Lionel Eyraud-Dubois - Allocation of Clients to
Multiple Servers on Large Scale Heterogeneous
Platforms.
We consider the problem of allocating a large number of
independent, equal-sized tasks to a heterogeneous large scale
computing platform. We model the platform using a set of
servers (masters) that initially hold (or generate) the tasks to
be processed by a set of clients (slaves). All resources have
different speeds of communication and computation and we model
contentions using the bounded multi-port model. Under this model,
a processor can be involved simultaneously in several
communications, provided that its incoming and outgoing
bandwidths are not exceeded. This model corresponds well to
modern networking technologies, but for the sake of realism, we
introduce another parameter that bounds the number of
simultaneous connections that can be opened at a server node. We
prove this additional parameter makes the problem of maximizing
the overall throughput (the fractional number of tasks that
can be processed within one time-unit) NP-Complete. This result
is closely related to results on bin packing with splittable
items and cardinality constraints. We also propose a polynomial
time algorithm based on a small resource augmentation : it
achieves optimal throughput while using only one more connection
per server node. This algorithm also provides a very good
approximation for the dual problem of minimizing the maximal
number of connections that need to be opened in order to achieve
a given throughput. Finally, we also propose extensive
simulations to assess the performance of the proposed algorithm.
- 13.20: Fengguang Song - A Scalable Multicast Scheme for
Distributed DAG Scheduling.
We have designed an application-level non-blocking multicast scheme for
dynamic DAG scheduling on large-scale distributed-memory systems. The
multicast scheme takes into account both network topology and space
requirement of routing tables to achieve scalability. Specifically, we
prove that the scheme is deadlock-free and takes at most logN steps to
complete. The routing table chooses appropriate neighbors to store based
on topology IDs and has a small space of O(logN). Our experimental
results show that the scheme is significantly better than the
simple flat-tree method and is close to vendor's collective MPI
operations.
- 13.40: Benjamin Depardon - Self-Stabilizing k-clustering
for weighted graphs.
Scheduling efficiently jobs on a platform, often require a
partitionning of the platform according to a given metric such as the
latency between nodes. Latency costs within such infrastructure can
be improved, or at least bounded, by using a k-clustering. A
k-clustering of a graph, is a partition of the nodes into disjoint
sets, called clusters, in which every node is distance at most k from
a designated node in its cluster, called the clusterhead. As Grids are
distributed, changing, and error prone environments, we designed a
self-stabilizing asynchronous distributed algorithm for constructing a
k-clustering of a connected network of processes with unique IDs and
weighted edges. The algorithm is comparison-based, takes O(nk) time,
and uses O(log(n) + log(k)) space per process, where n is the size of
the network. To the best of our knowledge, this is the first
distributed solution to the
k-clustering problem on weighted graphs.
- 14.00: Matthew Nabity - Look ahead technique for
reduction to Hessenberg form: design of the algorithm and
applicability on current hardware.
The classic approach to solve the nonsymmetric eigenvalue problem is to first reduce a nonsymmetric (square) matrix to Hessenberg form and then to perform the Hessenberg QR algorithm. This talk is concerned with the first part: the reduction to Hessenberg form.
Reduction to Hessenberg form is a challenging task on modern computers. This operation belongs to the class of two-sided transformations. The algorithm has n steps where n is the size of the initial matrix and, at step k, a linear transformation needs to be applied on the left and on the right of a matrix of order (n-k). This implies huge constraints on the memory bandwidth of the machine.
Dongarra, Sorensen and Hammarling [1] proposed a block algorithm for the reduction to Hessenberg form. The algorithm consists of a loop of panel factorizations and updates of the trailing matrix. Updates of the trailing matrix are performed with Level-3 BLAS and encompass a major part of the operations, which gives a somewhat efficient algorithm. A slight variant (more economic) is used in LAPACK. Unfortunately, the panel factorization is based on Level-2 BLAS and can consume a considerable amount of time on a multi-core platform.
In this presentation we extend the idea of Dongarra, Sorensen and Hammarling from blocking to look-ahead. The idea of look-ahead is to overlap panel factorization and update. We trade flops, memory copy and memory space for this overlap.
It is understood that look-ahead does not bring tremendous speed-up. Since there are only two tasks (panel and update), the best speed up we are targeting is a factor of two. Moreover, on multi-core architecture, the panel factorization is considerably longer than the update and therefore we predict this technique to bring a very small speedup if any in that basic setup. However recent experiments from Tomov [2] show that the panel factorization can be performed efficiently on GPU. In this hybrid context, where one unit is designed for panel factorization (GPU) and another unit is designed for update (multi-core), the look ahead technique offers an interesting alternative to the current LAPACK algorithm.
Further development is needed to potentially make this approach effective on an-only-multi-core architecture. For example, one can think of changing the data structure of the matrix. One data structure designed for the update and another for the panel factorization yielding in some sense, a hybrid data structure.
Overall, we do not have a magic solution to solve the reduction to Hessenberg form on multi-core platform. We indeed think that the basic algorithm is as good as it can get in the big O sense. In this work, we improve data movement efficiency by playing on data structure and/or computing component characteristics but unfortunately the data to be moved (bandwidth strain) to go from dense to Hessenberg form is the same as the classic algorithm or the LAPACK algorithm in the big O sense.
[1] J. J. Dongarra, D. C. Sorensen, and S. J. Hammarling. Block reduction of
matrices to condensed forms for eigenvalue computations. J. Comput. Appl.
Math., 27(1-2):215-227, 1989. Reprinted in Parallel algorithms for numerical
linear algebra, 215-227, North-Holland, Amsterdam, 1990.
[2] S. Tomov. Personal communication.
- 14.20: Mathias Jacquelin - Steady-State Scheduling on
CELL.
During this talk, we will present ongoing work on steady-state
scheduling for the CELL processor.
We focus on stream processing applications that consist in many
different pipelined tasks organized
as a directed graph, and we allocate these tasks to the different
processing elements of the CELL,
in order to ensure an optimal throughput.
We have implemented such an application (vocoder).
We will present the theoretical foundations of this work as well as
preliminary performance results.
- 15.00: Emmanuel Agullo - Comparative Study of One-Sided
Linear Algebra Factorizations with
Open-Source and Equivalent Vendor Software on Various Multi-Core
Hardware.
The emergence and continuing use of multi-core architectures requires changes
in the existing software and sometimes even a redesign of the established
algorithms in order to take advantage of now prevailing parallelism.
The Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is
a project that aims at achieving both high performance and portability across
a wide range of multi-core architectures. I will present a comparative
study of PLASMA's performance against established linear algebra
packages (LAPACK and ScaLAPACK), new approaches at parallel execution
(Task Based Linear Algebra Subroutines - TBLAS) as well as equivalent commercial
software offerings (MKL, ESSL and PESSL). Our experiments were conducted
on three one-sided linear algebra factorizations~(LU, QR and Cholesky) and used
two multi-core architectures (based on Intel Xeon EMT64 and IBM Power6).
The performance results show improvements brought by new algorithms on up
to 32 cores - the largest system we could access.
- 15.40: Piotr Luszczek - Generic Dynamic Scheduler for
Numerical Libraries on Multicore Processors.
Different approaches are briefly discussed for implementing numerical
software on multicore processors, including the static
scheduling employed currently by the PLASMA project, as well as
dynamic scheduling using Cilk and SMPSs. Declarative programming
is also considered. Performance comparisons are presented. A
presentation of ICL's dynamic scheduler follows. The scheduler
follows closely SMPSs in methodology. The DAG of the computation
is explored by a sliding-window and tasks are scheduled
according to data dependencies. RAW, WAR and WAW data hazards
are considered. Unlike SMPSs the scheduler relies exclusively on
an API, without any use of compiler technology. Finally,
opportunities are discussed for customizations specific to
numerical computing.
- 16.20: Richard Vuduc - Asynchronous Parallel Dense Linear
Algebra in the Concurrent Collections Programming
Model.
We apply a novel and general programming model, the Intel Concurrent
Collections (CnC), to the implemention of asynchronous parallel
dense linear algebra algorithms on current multicore and future
manycore systems. In CnC, the programmer expresses her
computation in terms of application-specific operations,
partially-ordered by semantic scheduling constraints. We
implement and evaluate CnC versions of both a recently proposed
asynchronous parallel algorithm for computing a Cholesky
factorization by Buttari, et al., as well as a novel
higher-level asynchronous eigensolver for dense symmetric
matrices. Given a well-tuned sequential BLAS, our
implementations match or exceed competing hand-tuned codes by up
to 2x. We include experimental evaluations against predominantly
bulk-synchronous approaches in alternative models, including
ScaLAPACK with a shared memory MPI, OpenMP, and Cilk++. Looking
forward, we identify new opportunities to improve the CnC
language and run-time scheduling and execution. This talk is
based on the work of Aparna Chandramowlishwaran (Georgia Tech),
in collaboration with myself and Kathleen Knobe (Intel).
- Friday, May 15
- 9.00: Bora Ucar - Towards parallel bipartite matching
algorithms.
We consider matchings yielding maximum diagonal products in
sparse matrices. An optimal matching is invariant under suitable
scaling. We use scaling algorithms to obtain at least one 1.0 per each
row and column. We find a maximum cardinality matching using only the
entries of magnitude 1.0. If the matching is perfect, we are done. If
not, we select a special entry and rescale the matrix. We repeat this
process until a perfect matching is found. Such an approach is guaranteed
to terminate with an optimal matching, however, can be painfully slow. We will
discuss alternatives where optimality is sacrificed for practicality.
- 9.40: Abdou Guermouche - Memory-Aware Scheduling for
Sparse Direct Methods.
Improving the memory behaviour is critical to solve huge
systems of sparse linear equations using parallel direct methods.
Indeed, the limiting factor of such methods to treat arbitrarily
large problems is always memory (rather than performance). To
tackle this problem, many studies and algorithms have been proposed
in the past. The main approach that allows a solver to process
very large problems is the use of disks to extend the memory
(out-of-core execution scheme). Even then, the memory scalability
of what remains in core memory is critical and deserves a special
attention. In this talk, we propose a new way of mapping tasks to
processors that achieves a high memory scalability. We present
some preliminary results using our new algorithm within the
MUMPS parallel sparse direct solver. Finally, we give some
hints about our current and future work, focussing on the use
of multicore architectures and of specific computing devices
like GPUs.
- 10.40: Padma Raghavan - Energy-Aware Scheduling for
Scalable Matrix Computations.
As we approach parallel computing at the petascale and beyond with
ensembles of many core chip multiprocessors, a key challenge
concerns sustainable energy-aware performance scaling of matrix
computations. We consider how this challenge can be addressed by
scheduling to explicitly reduce energy while improving
performance. We discuss initial results on the energy and
performance impacts of task-to-processor node mapping and
thread-to-core mapping at a processor node for parallel sparse
linear system solvers.
- 11.20: Julien Langou - Fault Tolerant Linear Algebra:
goals and methods.
Our goal is to develop algorithms for dense linear algebra that are
efficient and fault tolerant. Given a dense linear algebra
computation, we would like to detect, locate and correct any
given set of errors. We will compare the computational cost of
two approaches. The first approach (RCRE) is a combination of
two known techniques. ``Computing Error'' are handled by
``Residual Checking and Rollback''. ``Data Error'' are handled
by ``Encoding''. These techniques have been shown to be the most
effective for their respective task. The second approach is
based on ``Algorithm-Based Fault Tolerance (ABFT).'' To make
ABFT practical for our goal, we add two key ingredients. First
ingredient is ``small step.'' Second ingredient is
``simultaneous use of row and column checksums.'' We conclude
our study with a comparison of the two afore mentionned methods
(RCRE vs ABFT). We compare how the methods behave in the
presence of Data Error and Computing Error and discuss their
computational costs.
Contact
For more information, please contact Anne Benoit.