Data Management in DIET

dagda_bigdata

Big Data Management

GridRPC environments such as NetSolve, Ninf, and DIET are based on the client-server programming paradigm. However, generally in this paradigm, no data management is performed. Like in the standard RPC model, request parameters (input and output data) are sent back and forth between the client and the remote server. A data is not supposed to be available on a server for another step of the algorithm (an new RPC) once a step is finished. This drawback can lead to extra overhead due to useless communications over the net.

This problem has been identified by NetSolve and Ninf projects as a major performance loss. NetSolve has proposed several ways to keep data in place. The first approach is called request sequencing. It consists in scheduling a sequence of NetSolve calls on one server. The sequence of request written between two sequence delimiters <netsl_sequence_begin> and <netsl_sequence_start> is analyzed and a dataflow graph is computed that allow useless data transfers to be avoided. However this feature is only available on a single server without redistribution between servers. Another approach is called Distributed Storage Infrastructure. The DSI helps the user for controlling the placement of data that will be accessed by a server. Instead of having multiple transmissions of the same data, DSI allows the transfer of the data once from the client to a storage server. A data handle is then used at the request level. DSI acts as a data cache. One instance of the DSI is based on IBP (Internet Backplane Protocol http://loci.cs.utk.edu/). This approach is interesting but not connected to the choice of computational servers. A last optimization has been provided that allows the redistribution of the data between servers and the persistence of data. A new API is provided that allow a client to manage its data locally and remotely between request calls. Ninf has similar solutions with other data management systems.

Usually, such environments have five different components: clients that submit problems to servers, servers that solve problems sent by clients, a database that contains informations about software and hardware resources, a scheduler that chooses an appropriate server depending of the problem sent and the informations contained in the database, and finally monitors that get informations about the status of the computational resources. In DIET, a server is built upon a Server Daemon (SeD). We have a hierarchical set of agents including Leader Agents (LA) and Master Agents (MA). Now we detail the basic functionalities of these different components.