slides

slides

For parallelism, the most important problem is to partition sparse matrices, graphs, or hypergraphs into nearly equal-sized parts while trying to reduce inter-processor communication. For better cache performance, the problem is to reorder sparse matrices by suitable row and column permutations. Common approaches to such problems involve multilevel methods based on coarsening and uncoarsening (hyper)graphs, matching of similar vertices, searching for good separator sets and good splittings, and two-dimensional matrix splitting methods such as incorporated in the software package Mondriaan.

We will discuss new algorithms and features included in version 3.0 of the Mondriaan package, to be released soon. By using this package, and its subprograms MondriaanPlot and MondriaanMovie, we can visualise the partitioning process of a sparse matrix by various algorithms. We can also do this in Matlab. Mondriaan has now been made into a library that can be called from other programs, such as Matlab, Mathematica, or as a standalone program. New reordering methods have been included such as Separated Block Diagonal (SBD), along with well-known methods such as Bordered Block Diagonal. Doubly separated and doubly bordered versions are also included.

slides. In case, the pdf file does not display the movies, you can download the files here as a tar archieve.

slides

slides

In this talk we briefly demonstrate the benefits of Tarragon on two model applications: finite-difference stencil computations and DNA sequence alignment. Then we discuss in-depth the implementation of parallel sparse LU factorization with Tarragon. To this end, we propose a novel task decomposition scheme and discuss the challenges that we face and the tradeoffs that arise: specifically, how to expose a high degree of parallelism, to reduce the space used for representing the graph, to hide the communication latency, and to preserve locality. Finally, we present a performance analysis of our preliminary results.

slides

slides (pdf)

slides

We will present a Markovian model for performance evaluation of a well-known load balancing technique, namely \emph{work stealing} in a large-scale heterogeneous system. Using mean field theory, we show that when the size of the system grows, this model converges to a system of deterministic ordinary differential equations that allows us to efficiently compute average performance indicators (such as average response times) as well as the distributions of these indicators.

We make a full study of this mean field model in the case where all resources are homogeneous, showing in particular that work stealing is very efficient, even when the latency of steals is large.

We also consider the case where the geometry plays a role: the system is made of several clusters, and stealing within one cluster is faster than stealing between clusters. In such cases, we compare different work stealing policies, based on stealing probabilities and we show that the main factor for deciding where to steal from is the load rather than the stealing latency.

slides

slides

slides

slides

slides

slides (pdf)

slides

slides

Frequency and voltage scheduling is a common technique for managing power in multiprocessor systems, in general, and chip multiprocessors, in particular. It minimizes energy consumption by dynamically matching the computation capacity of the system to the needs of the executing applications. It is crucial, however, to design power aware scheduling algorithms that do not reduce the response time or throughput since users will not accept any degradation in system's performance. In this talk, I will discuss frequency/voltage scheduling in a range of application models, including the Amdahl parallel model, the streaming model and the unstructured computation model.

slides

On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency), classifying polynomial versus NP-complete instances. Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to assess the impact of processor sharing on the quality of the solution.

slides

slides

We first analyze the First-Come, First-Served (FCFS) algorithm and show that it is a (2 D + 1)-approximation algorithm where D is the ratio of the largest task's size to smallest task's size. We propose a dual-approximation based algorithm and show that it is locally a 2-approximation. Using the adversary technique we also show that the D factor can hardly be improved on. Furthermore, using resource augmentation, we demonstrate that having additional rho machines can improve the approximation ratio up to rho-th root of D. Based on this observation, we propose a heuristic that reserves a small portion of the cluster for tasks with large stretch values. Experimental results indicate that our dual-approximation based algorithm is an order of magnitude better than FCFS and the machine reservation heuristic can improve the performance significantly.

slides

abstract and speaker info in (pdf) and slides (pdf)

slides

slides

slides

Our survey is based on dead-lines of jobs. When a job arrives in the waiting queue, the scheduler forecasts its placement using a simple first-come-first-serve scheduling and assign it a dead-line. The main idea is to use less resources than a job has requested to schedule it sooner while respecting its deadline. In practice cluster node virtualization will be used to create "virtual nodes". When a resource is released, jobs in the waiting queue are rescheduled by taking into account the previous calculated deadline. We use heuristics to choose the best job to schedule with the available resources and to allocate an appropriate number of processors for it.

slides

slides

In this talk, we present a new vision of Cloud Computing that addresses these shortcomings. We propose the use of edge computers in two novel ways. For problem (1), we propose Nebulas, a Cloud comprised of edge computers located across the Internet. We show how several classes of applications may be better suited to this new Cloud model. For problem (2), we describe how a proxy network comprised of edge computers can accelerate applications that access one or more Clouds. In both cases, we present the resource management challenges associated with each model.

slides