| E. Maisonnave | CERFACS | maisonna@cerfacs.fr |
World's climate is currently changing due to the increase of the greenhouse gases. Climate fluctuations are forecasted for the years to come. For a proper study of the upcoming changes, numerical simulations are needed. Imperfection of the models and global insufficiency of observations make it difficult to tune model parametrization with precision. Uncertainty on climate response to greenhouse gases can be investigated by performing an ensemble prediction. Within this approach, a set of simulations with varying parameters must be launched. Each simulation models the evolution of the present climate followed by the 21st century.
Comparing these simulations, we expect to better understand the relations between the variation in parametrization with the variation in climate sensitivity to greenhouse gases.
Our goal regarding the application is to continue the work of (1). Once a proper model of the application is derived, appropriate scheduling heuristics are proposed, tested, compared, and then implemented within the DIET Grid middleware. Once the implementation done, we compare the simulations with real experiments realized on the Grid.
The proposed climate modeling application consists of executing simulations of present climate followed by the 21st century, for a total of 150 years (scenario). A scenario combines 1800 simulations of one month each (150x12), launched one after the other. The results from the nth monthly simulation are the starting point of the (n+1)th monthly simulation. For the whole experiment, several scenarios must be performed. The number of months per scenario and the number of scenarios are chosen by the user.
A monthly simulation can be divided into a pre-processing phase, a main-processing parallel task, and a post-processing phase of analysis. Figure 1 shows the different tasks during the execution of a month simulation (nodes) and the data dependencies between two consecutive months (edges). The number after the name of each task represents a possible duration of the tasks in seconds.
During the pre-processing phase, input files are updated and gathered in a single working directory by concatenate_atmospheric_input_files (caif) and the model parametrization is modified by modify_parameters (mp).
The main task process_coupled_run (pcr) performs a one month integration of the climate model. This model is composed by an atmosphere (ARPEGE (2)), an ocean and its sea-ice (OPA/NEMO (3)), and a river runoff model (TRIP (4)). The OASIS coupler (5) ensures simultaneous run of each element and synchronizes data exchanges. ARPEGE code is parallel (using MPI), while OPA, TRIP, and the OASIS coupler are sequential applications. With more than 8 processors allocated to \arpege the speedup stops. OPA, TRIP and OASIS each need a processor, so pcr needs from 4 to 11 processors to be able to work.
The post-processing phase consists of 3 tasks: a conversion task convert_output_format (cof) where each diagnostic file coming from the different elements of the climate model is standardized in a self-describing format; an analysis task extract_minimum_information (emi) where global or regional means on key regions are processed; and a compression task compress_diags (cd) where the volume of model diagnostic files is drastically reduced.
Data exchanges between two consecutive monthly simulations belonging to the same scenario reach 1 GB. The post-processing phase creates an archive with results of the month execution that can be interpreted to predict the climate. The size of this archive is almost 120 MB. Simulations are independent, so no other data is used.
Given the really short duration of the pre-processing tasks compared to the execution time of the main-processing task, we made the decision to merge them all in a single task. The same decision was taken for the 3 post-processing tasks. So, in regard of the model, there are now 2 tasks: the main-processing task and the post-processing task. Figure (2) presents the new dependencies between tasks after merging them together.

Figure 1 Two consecutive monthly simulations.

Figure 2 Merged tasks.
Grid'5000(6) is a grid composed of several clusters. Each cluster is composed of homogeneous resources but differs from one another. The clusters either have 2 or 4 cores per node. The speedup of the application is superlinear when adding resources. E.g. between 4 and 8 resources for a main-task, the speedup is almost 2.8.
The goal we want to achieve here is to minimize the overall makespan and also to keep some fairness, meaning we want all scenarios to progress in their execution at almost the same speed. We consider an homogeneous platform composed of R resources and that data on a cluster are available to all of its nodes. The idea of this scheduling algorithm is to divide the resources of the platform into disjoint sets on which multiprocessor tasks will be executed such that the overall makespan will be minimal.
In (7), (8), the authors propose to give more and more processors to the critical path of the application. Since all our DAGS are the same, and tasks inside the DAGS are also the same, we will not choose how many resources to give to a DAG, but choose which DAG will go on a given group of resources. To obtain fairness, when a group of resources becomes idle, we will schedule the next task of the less advanced DAG on this group.
To compute the grouping of resources, we use some of the notations defined in (1):
The optimal repartition of the R processors in groups on which the multiprocessor tasks should be executed can be viewed as an instance of the Knapsack problem with an extra constraint (no more than NS simulations can be executed simultaneously). Given a set of items with a cost and a value it is required to determine the number of each item to include in a collection such that the cost is less than some given cost and the total value is as large as possible.
In our case, there are 8 possible items (groups of 4 to 11 processors). The cost of an item is represented by the number of resources of that grouping. The value of a specific grouping G is given by 1 / T[G], which represents the fraction of a multiprocessor task being executed during a time unit for that specific group of processors. We have n_i unknowns (i from 4 to 11) representing the number of groups with i resources which will be taken in the final solution. The portion of code executed at each time step

has to be maximized knowing that we can not use more than R resources

and that we can not have more than NS groups

Such a linear program is solved quite fast. The n_i are integers and can only be between 0 and NS. Even with NS really higher than 10 (the number of scenarios we want to schedule), the resolution of the program is instantaneous. Furthermore, the grouping given by the linear program is the optimal one, except for the last set of main-tasks.
We need to add

to the makespan obtained for the main tasks to obtain the total makespan.
The scheduling heuristic presented in the previous section is designed for homogeneous platforms. We intend to deploy Ocean-Atmosphere on Grid'5000 so we need to adapt the algorithm to be able to work on heterogeneous platforms. In order to reduce the computation time of NS scenarios, the best way is to divide the set of scenarios into subsets and execute each subset on a different cluster.
Algorithm (1) describes the way the repartition is done between clusters. Input parameters are: n, the number of clusters, and "performance" an array. This array has been initialized by the SeDs (running on each cluster) using the performance evaluation. The performance evaluation fills a vector with the execution time to execute from 1 to NS scenarios. To know the time needed by cluster C_i for X scenarios, we just have to read the value in performance[i, X]. The algorithm behaves as follows: first, the number of scenarios attributed to each cluster is set to 0. Then, each scenario is scheduled on the cluster on which the total makespan increases the less. When all the simulations are scheduled, this scheduling is returned. This algorithm is realistic because the number of simulations (NS) and clusters (n) are quite low in our case.

Algorithm 1
If the number of scenarios or the number of clusters becomes really large, the "performance" array may become very large. The size of the matrix is n x NS. If the client has not enough memory, the algorithm can not run. Another drawback of this algorithm, with a load increase, is that each cluster must compute estimations of the makespan NS times before the algorithm can start. This estimation could take some time. If this case occurs, the repartition should be done by another heuristic. This did not occur during our experiments, so we did not investigate this problem further.
On heterogeneous platforms, the scheduling is done at two different levels. At a local level (the cluster), the resources grouping is chosen, and at a global level (the grid), the number of scenarios to send to each cluster is chosen.
The different steps of the execution on several clusters are the following: (1) the client sends a request to the MA; (2) the MA looks for available SeDs through the DIET hierarchy; (3) each SeD computes its estimation vector; (4) the estimation vector is send back to the MA; (5) the client receives the estimation vectors; (6) it computes the repartition of the scenarios; (7) it sends multiple requests to the SeDs; (8) Finally, each SeD computes its assigned scenarios.
To test the heuristics given in the previous sections, we performed simulations to analyze their efficiency. The comparisons of makespans are done by simulating a real execution. Performance used to compute the makespans come from benchmarks performed on different clusters of Grid'5000. To see an evaluation of the heuristic presented for homogeneous platforms the reader is referred to (1).
To analyze the repartition algorithm, we used the execution time of benchmarks made on 5 clusters of Grid'5000. Each cluster is given the same number of processors. Figure 3 presents the comparison of Algorithm 1, defined in the section about heterogenous scheduling, to a Round Robin.

Figure 3 : Repartition VS Round Robin.
We can see that gains are more and more important when more resources are available. Another point is that there are some steps. In such a case, the same cluster stays the slowest for some time. The gain improves later when it becomes better to map one of the scenarios on a faster cluster. When the number of resources is high enough, the gain stays constant. In the example, the gain is around 25%. This number corresponds exactly to the difference of execution time between the fastest and the slowest cluster when computing one monthly simulation on 11 processors. A 25% gain in the figure corresponds almost to 80 000 seconds (22 hours).
No website yet.
Not distributed.
Fortran 77/90 for the ocean atmosphere application and C for the client and servers
gcc, gfortran
BLAS, LAPACK, LAM-MPI, PSMILE-OASIS, OpenMPI, NetCDF, NCO
Linux
Results and restart files: 120MB per month; Postrpocessing files: 1GB
35 minutes per month for the main task.
This is a sequential application
No.