Unlike kernel-level checkpointing support, user-level checkpointing support
does not require additional code to be installed into the kernel, and it
therefore works cross-platform.
However, user-level checkpointing depends on either the process or its linked-in
libraries to handle the checkpoint migration. This means that processes
must either be written specifically for checkpointing (currently), or must
be linked to special libraries that perform the checkpointing.
At some point
in the future, freely available libraries with source code may be made available
to the GNU Queue community to support user-level checkpoint migration. Currently,
users wishing to checkpoint migrate their jobs must modify their code to support
migration.
When the time comes to move from one host to another, GNU Queue will send
the process a SIGUSR2 signal. The process is expected to trap this signal,
save its state, and die gracefully. Later, GNU Queue will restart the job
on another host, giving the process an additional command line option,
the restartflag . The process is expected to see this flag and
recover its state from the restartfile it previously wrote out.
To enable user-level checkpoint migration for a given job queue, be sure
you are using GNU Queue 1.20.1 and higher compiled with the ENABLE_CHECKPOINT
compile option set. (This is the default in the development release.) Set the
following options in the `profile' file for the job queue that is to run the
jobs modified for checkpoint migration:
checkpointmode 2
restartmode 1
loadcheckpoint value
restartflag command_line_option
The checkpointmode 2 specifies the user-level API and the
restartmode enables this host to accept incoming migrators.
loadcheckpoint is the loadaverage at which jobs may be
migrated out (provided another host is willing to accept them). It should be
set to an integer greater than or equal to loadsched ,
the load average at which the job queue refuses to accept new jobs, but less
than loadstop , the load at which all jobs in the queue are
suspended.
restartflag option specifies the additional command line option
that will be given the job on restart.
werner.krebs@yale.edu |