GNU Queue load-balancing system:How do I do checkpoint migration with GNU Queue?:
How do I setup user-level checkpointing?

Unlike kernel-level checkpointing support, user-level checkpointing support does not require additional code to be installed into the kernel, and it therefore works cross-platform.

However, user-level checkpointing depends on either the process or its linked-in libraries to handle the checkpoint migration. This means that processes must either be written specifically for checkpointing (currently), or must be linked to special libraries that perform the checkpointing.

At some point in the future, freely available libraries with source code may be made available to the GNU Queue community to support user-level checkpoint migration. Currently, users wishing to checkpoint migrate their jobs must modify their code to support migration.

When the time comes to move from one host to another, GNU Queue will send the process a SIGUSR2 signal. The process is expected to trap this signal, save its state, and die gracefully. Later, GNU Queue will restart the job on another host, giving the process an additional command line option, the restartflag. The process is expected to see this flag and recover its state from the restartfile it previously wrote out.

To enable user-level checkpoint migration for a given job queue, be sure you are using GNU Queue 1.20.1 and higher compiled with the ENABLE_CHECKPOINT compile option set. (This is the default in the development release.) Set the following options in the `profile' file for the job queue that is to run the jobs modified for checkpoint migration:

checkpointmode 2

restartmode 1

loadcheckpoint value

restartflag command_line_option

The checkpointmode 2 specifies the user-level API and the restartmode enables this host to accept incoming migrators. loadcheckpoint is the loadaverage at which jobs may be migrated out (provided another host is willing to accept them). It should be set to an integer greater than or equal to loadsched, the load average at which the job queue refuses to accept new jobs, but less than loadstop, the load at which all jobs in the queue are suspended.

restartflag option specifies the additional command line option that will be given the job on restart.
werner.krebs@yale.edu

Previous: How do I setup kernel-level checkpointing?
This document is: http://bioinfo.mbb.yale.edu:80/cgi-bin/fom?file=44
[Search] [Appearance] [Show Edit Commands]
This is Faq-O-Matic 2.606.