Cloud Duplication strategy for XWHEP Simon Delamare - INRIA JRA2 Purpose of the document ======================= This document describes the Cloud Duplication strategy in SpeQuloS and how it is implemented in XtremWeb-HEP. This document may help SpeQuloS administrators to use Cloud Duplication strategy in their deployments. 1. Cloud Duplication Strategy Principles ======================================== Cloud Duplication is a strategy used by SpeQuloS to deploy Cloud resources to support batch execution. Cloud Duplication is an advanced and efficient strategy when compared to default ("Flat") strategy. With Flat strategy, Cloud workers started by SpeQuloS to support a batch connect to the Desktop Grid server where the batch is executed, and ask for uncompleted jobs. In this strategy, Cloud workers are not treated differently from other workers and do not receive job from DG scheduler in priority. As a consequence, if all jobs are already assigned to regular workers, the DG scheduler will not assign any jobs to Cloud workers, which may lead to waste of Cloud resource and poor QoS level if regular workers are slow or fail. To avoid this problem and enable full QoS benefits, two strategies may be investigated: - Rescheduling of remaining jobs to Cloud workers, but this requires to modify the DG scheduler in order to Cloud workers to be treated specially. - Cloud Duplication The principle of Cloud Duplication strategy is to duplicate uncompleted jobs to a new DG server (called Cloud server), that will only serve Cloud workers. When Cloud workers complete jobs, results are copied back from Cloud server to original DG server. This strategy ensure that Cloud worker are not in competition with regular work and receive jobs to compute until all jobs are completed. When QoS is triggered, here are the various steps of the Cloud Duplication strategy. - The Cloud server is started if necessary - All uncompleted jobs (waiting or under execution by regular workers) from DG server are copied to Cloud server. - Cloud workers are started and connect to Cloud server to process copied jobs (jobs that were waiting in original DG server are processed in priority) - Periodically, all completed jobs on the Cloud server are copied back to original DG server with their results. - When all jobs on the Cloud server or on the DG server are completed, the Cloud workers and the Cloud server are shutdown. 2. Cloud Duplication Strategy for XWHEP ======================================= The Cloud Duplication strategy is implemented for XtremWeb-HEP middleware. The Cloud duplication strategy requires a dedicated XWHEP server (for Cloud workers to connect). However, a single server is needed to handle Cloud duplication of several batches, even from different original DGs. This is achieve using XWHEP batchid option in Cloud worker configuration file, which ensures that a worker will only fetch jobs belonging to the indicated batch. In addition, a XWHEP client must be installed on the server to manage it, as well as the "dg2cloud.sh" and "cloud2dg.sh" scripts. These scripts are used to copy uncompleted jobs from a Desktop Grid to the local Cloud server and to copy the completed job on the Cloud server to the original Desktop Grid, respectively. Scripts use the XWHEP client, which role is to get and submit applications, jobs and results. Both script are run each 15 minutes using cron, to ensure that jobs newly submitted to DG are copied to Cloud server, and that completed job on the Cloud server are quickly copied to DG. Here are the algorithm implemented in scripts dg2cloud.sh ----------- Parameters: BATCH_ID, DG_CLIENT, CLOUD_CLIENT Begin: Jobs <- DG_CLIENT.get_uncompleted_jobs_from_group(BATCH_ID) If not CLOUD_CLIENT.get_groups(BATCH_ID): CLOUD_CLIENT.create_group(BATCH_ID) If not CLOUD_CLIENT.get_apps(Jobs[0].get_app()): App <- DG_CLIENT.download_app(Jobs[0].get_app()) CLOUD_CLIENT.upload_app(App) For J in Jobs: If not CLOUD_CLIENT.get_jobs(J): CLOUD_CLIENT.submit_job(J,App) End cloud2dg.sh ----------- Parameters: BATCH_ID, DG_CLIENT, CLOUD_CLIENT Begin: Jobs <- CLOUD_CLIENT.get_completed_jobs_from_group(BATCH_ID) For J in Jobs: if not DG_Client.is_completed(J): R <- CLOUD_CLIENT.get_result(J) DG_CLIENT.upload_result(R) DG_Job.update_job(J,status="COMPLETED",result=R) End To retrieve relationship between an object (job,application or group) existing on the original Desktop Grid with its copy on the Cloud server, the same XWHEP object UID is used for the two objects. This is implemented using XML description of an XWHEP object, which allows to specify the object UID. For instance, to copy a job from the Desktop Grid server to the Cloud server and preserve the job UID, the following shell commands are used: DG_CLIENT_PATH/xwworks --xwformat xml $WORK_IUD > job.xml CLOUD_CLIENT_PATH/xwsubmit --xwxml job.xml The XML description is also used to update Desktop Grid job status when it is completed on the Cloud server. The job XML description is downloaded from the DG server, modified to set its status to "COMPLETED" and to link to the results previously uploaded, and finally, it is uploaded on the DG sever. Here are the shell commands to perform these operations: #Uploading results to DG" RESULT_UID=`$DG_CLIENT_PATH/xwsenddata $RES_FILE | grep ^xw` #Downloading job description from DG $DG_CLIENT_PATH/xwworks --xwformat xml $JOB_UID | grep ' dg_job.xml #Updating Job description for i in `cat dg_job.xml; do echo -n $i" " | sed 's,status=\".*\",status="COMPLETED",g' \ | sed 's,resulturi=\".*\",resulturi="'$RESULT_UID'",g' \ | sed 's,error_msg=\".*\",,g' | sed 's,/>,error_msg="Computed by SpeQuloS using Cloud Duplication" />,g'; done; echo "" > dgjob.xml #Uploading Job description on DG $DG_CLIENT_PATH/xwsendwork --xwxml dgjob.xml 3. Using Cloud Duplication Strategy in SpeQuloS =============================================== SpeQuloS can use Cloud Duplication strategy to support batch executed on XWHEP Desktop Grids. Cloud Duplication strategy is triggered by SpeQuloS scheduler when Cloud resources are used for each batch supported by QoS. The Cloud Duplication starting is triggered at the "configure_QoS" step, i.e. just before the scheduler starts Cloud workers to support a batch for the first time. The Cloud duplication ends at the "unconfigure_QoS" step, i.e. when Cloud resources stop to support a batch (because the batch is completed or the provisioned credits are spent). To enable Cloud Duplication use in SpeQuloS, the SpeQuloS administrator must verify that the variable CLOUD_DEPLOYMENT["XWHEP"] inside the scheduler/dg_rpc.py file is set to "DUPLICATION". Then, the administrator must edit the Desktop Grids configuration files of Desktop Grids registered to the scheduler. SpeQuloS executes the command given by the CMD_DUPLICATION_START variable at the "configure_QoS" step and CMD_DUPLICATION_STOP at "unconfigure_QoS" step. Following environment variables are setup and accessible to these commands: - $SQS_BATCH_ID: Contains the identifier of the batch supported by SpeQuloS - $SQS_DG_ID: Contains the identifier of the Desktop Grid where the batch is executed - $SQS_DG_CONF: The path to the Desktop Grid configuration file. Whatever is the Cloud handler used (libcloud or command), the parameters defining Cloud workers action in the DG configuration file must also be adapted to ensure that Cloud workers will process jobs form Cloud server (and not from original DG server). By setting CMD_DUPLICATION_START and CMD_DUPLICATION_STOP variables, the administrator is free to implement its own duplication strategy. However, as it is a complex operation, we provide a reference implementation in SpeQuloS package, that must be adapted according to the infrastructure where SpeQuloS is deployed. The reference implementation can be found in the SpeQuloS package, inside the DG- plugins/XWHEP/duplication/ directory. The directory contains: - dg2cloud.sh and cloud2dg.sh, the scripts used to copy and merge jobs between DG and Cloud servers, as described above. - start_cloud_duplication.sh and stop_cloud_duplication.sh, the scripts used by SpeQuloS to trigger Cloud Duplication. The reference implementation assumes that an XWHEP server exists to be used as the "Cloud server". It requires that the server is fully dedicated to Cloud duplication and runs permanently. The SpeQuloS administrator must have an SSH access to the machine using a public/private keys that does not require password. A XWHEP client must also be installed on the server with "SUPER_USER" rights on the server. The files dg2cloud.sh and cloud2dg.sh, used to copy and merge jobs between DG and Cloud servers and described above, must be copied on server. The files start_cloud_duplication.sh and stop_cloud_duplication.sh must be copied on the machine executing SpeQuloS Scheduler module. To use Cloud duplication, an XWHEP client of each XWHEP DGs managed by SpeQuloS must also exists on the machine. The start_cloud_duplication.sh script takes arguments: The batch identifier (i.e., the XWHEP group UID) involved with Cloud Duplication and the path to the XWHEP client for the XWHEP server where the batch is executed, on the SpeQuloS scheduler host. The stop_cloud_duplication.sh script takes the batch identifier as argument. The script start_cloud_duplication.sh and stop_cloud_duplication.sh must be edited with Cloud server information. At the beginning of the files, the following variables must be edited: - CLOUD_SERVER_SSH: The SSH address to Cloud server. For instance: "root @cloud-server" - SSH_OPTS: The SSH options to ensure a non interactive login (without prompting for a password). For instance: "-i ~/.ssh/id_rsa -o StrictHostKeyChecking=no" - SCRIPT_DIR: The directory where dg2cloud.sh and cloud2dg can be found on Cloud server. For instance: "/root" - CLOUD_CLIENT_DIR: The directory where the XWHEP client for the Cloud server is installed on the Cloud server. For instance: "/root/cloud- client" - DG_CLIENT_DIR: The directory where original DG client will be copied on Cloud server. Should be "/tmp/xwhep-client-$BATCH_ID" When executed, the start_cloud_duplication.sh script will copy the XWHEP client to original DG to the Cloud server, a create a new cron task on Cloud server that periodically executes dg2cloud.sh and cloud2dg with appropriate "BATCH_ID", "DG_CLIENT" and "CLOUD_CLIENT" arguments. The stop_cloud_duplication.sh script only removes the DG client and cron task from Cloud server. When this reference implementation is used, the SpeQuloS scheduler DG configuration file should be like the following: DG_TYPE=XWHEP DG_PLUGIN_URL="http://spequlos:edgi@xw.server.fr:4330/XWHEP/spequlos/" CW_SSH_CMD="apt-get update > /dev/null; apt-get -y install openjdk-6-jre > /dev/null; wget http://cloud.server.eu/XWHEP/download/xwhep-worker.deb 2>/dev/null; dpkg -i xwhep-worker.deb; sed -i -e's/batchid=.*//' /opt /xwhep-worker/conf/xtremweb.worker.conf; echo batchid=xw://cloud.server.eu/$SQS_BATCH_ID >> /opt/xwhep- worker/conf/xtremweb.worker.conf; /etc/init.d/xtremweb.worker restart > /dev/null 2>&1 &" CMD_DUPLICATION_START="/root/spequlos/DG- plugins/XWHEP/duplication/start_cloud_duplication.sh $SQS_BATCH_ID /root/xwlriclient" CMD_DUPLICATION_STOP="/root/spequlos/DG- plugins/XWHEP/duplication/stop_cloud_duplication.sh $SQS_BATCH_ID"