Start your program like so:
setarch x86_64 -L -B -R ./a.out -_condor_ckpt ckpt
Replace a.out with your program name.
Replace "x86_64" with "i386" if you're on a 32-bit system. You may be able to omit the -B and associated 32-bit memory limit under unknown circumstances. You can add arguments to the program itself before or after the arguments above.
Replace ckpt with the file name you want your checkpoint written into. You can omit all of the arguments, in which case the checkpoint file is your program name with .ckpt appended.
To checkpoint and exit, send SIGTSTP. To checkpoint and continue, send SIGUSR2.
Checkpoints should be atomic; they are written to ckpt.tmp and moved to ckpt upon completion.
When checkpointing and exiting, the process will exit with the signal SIGUSR2 (Not SIGTSTP, which is what you sent it!)
Resuming a checkpoint
To resume from a checkpoint, do this:
setarch x86_64 -L -B -R ./a.out -_condor_restart ckpt -_condor_relocatable
This is the same command line as before, but -_condor_ckpt is now -_condor_restart and -_condor_relocatable has been added.
Checkpointing in HTCondor's vanilla universe
- Register a signal handler for SIGTERM; it should send SIGTSTP to the running "real" job.
- Register a signal handler for SIGTSTP; it should send SIGTSTP to the running "real" job. (This will probably be unused, but is included to offer an option that is more similar to the standard universe.)
- Register a signal handler for SIGUSR2; it should send SIGUSR2 to the running "real" job. (As of July, 2012, this will never be used, but we anticipate adding support for periodic checkpoints in the vanilla universe.)
- Is the checkpoint file ("my-real-job.ckpt") present? Then the arguments are "-_condor_restart my-real-job.ckpt -_condor_relocatable". Otherwise the arguments are "-_condor_ckpt my-real-job.ckpt".
Start your "real" job
Start under setarch to disable address randomization and similar that HTCondor checkpointing cannot cope with. The "-B" (limiting memory to 32 bit addresses) may not be necessary in all cases, but the specific circumstances are not known. The
are as determined above.
- 32-bit: setarch i386 -L -B -R ./my-real-job <arguments>
- 64-bit: setarch x86_64 -L -B -R ./my-real-job <arguments>
- Note the PID, for use in the signal handlers mentioned above.
- Start under setarch to disable address randomization and similar that HTCondor checkpointing cannot cope with. The "-B" (limiting memory to 32 bit addresses) may not be necessary in all cases, but the specific circumstances are not known. The <arguments> are as determined above.
Wait for your "real" job to exit.
- If the job exited with SIGTSTP, then this is a checkpoint.
# Terminal checkpoint: register signal SIGTERM handler: kill -TSTP $PID # Terminal checkpoint: register signal SIGTSTP handler: kill -TSTP $PID # Periodic checkpoint: register signal SIGUSR2 handler: kill -USR2 $PID # Modern address randomization and related cause us grief. Disable. $ARCH = i386 (or x86_64) $COMPAT = setarch $ARCH -L -B -R if a CKPT_FILENAME is NOT present $COMPAT ./$PROGRAM -_condor_ckpt $CKPT_FILENAME note the $PID else $COMPAT ./$PROGRAM -_condor_relocatable -_condor_restart $CKPT_FILENAME note the $PID
Working Perl code: See checkpoint-wrapper , an attachment to this page.
A program that has been condor_compiled takes additional command line options. Generally speaking end users shouldn't need to know this; HTCondor will invoke them automatically. But for testing or doing checkpointing without the standard universe, this might be useful. These are officially undocumented and may change without warning!
- -_condor_cmd_fd <FD> - File descriptor that the caller will stream a command file in on. Also enables remote syscalls. Used by the starter. Incompatible with -_condor_cmd_file.
- -_condor_cmd_file <filename> - Identical to -_condor_cmd_fd, except that the command file is read from <filename> instead of an FD. Incompatible with -_condor_cmd_fd.
- -_condor_debug_wait - Upon starting up, the process will enter an infinite loop, allowing you to attach to a debugger where you can break the loop.
- -_condor_nowarn - Disable notice messages. Obsolete, prefer "-_condor_warning UNSUP OFF"
-_condor_warning <kind> <mode>
- Set warning level.
- <kind> can be: ALL, NOTICE, READWRITE, UNSUP, BADURL
- <mode> can be: ON, OFF, ONCE
- -_condor_D_<LEVEL> - Add D_<LEVEL> to the debug levels being logged. eg "-_condor_D_FULLDEBUG" turns on D_FULLDEBUG.
- -_condor_ckpt <filename> - File to checkpoint into.
- -_condor_restart <filename> - Restart from the checkpoint <filename>. Implies "-_condor_ckpt <filename>". For safety and clarity, specify -_condor_ckpt or -_condor_retart, not both. If you specify both, the <filename> specified in the last one will be used for both.
- -_condor_aggravate_bugs - Engage in behavior that should be safe, but will likely tickle bugs in the checkpointing code. At the moment the code avoids assigning virtual FDs between 2 and 32 inclusive in an attempt to ensure that the virtual FDs and the real FDs aren't the same, revealing bugs where the wrong one is used.
- -_condor_relocatable - Allow the working directory to change on restart, instead of forcing it to be the same as when last checkpointed.
You can pass a command file in through a file descriptor (-_condor_cmd_fd) or a file (-_condor_cmd_file). The command file can set several options, and also enables remote syscalls simply by being specified.
Commands in the file are one per line. The recognized commands are
- iwd <pathname> - Change working directory to <pathname>. This is intended to get the process into the working directory specified in the user's job description file.
- fd <n> <pathname> <open_mode> - Open the file <pathname> with the mode <open_mode>. Set things up (using dup2()), so that the file is available at file descriptor number <n>. This is intended for redirection of the standard files 0, 1, and 2.
- ckpt <pathname> - The process should write its state information to the file <pathname> so that it can be restarted at a later time. We don't actually do a checkpoint here, we just set things up so that when we checkpoint, the given file name will be used. The actual checkpoint is triggered by recipt of the signal SIGTSTP, or by the user code calling the ckpt() routine.
- restart <pathname> - The process should read its state information from the file <pathname> and do a restart. (In this case, main() will not get called, as we are jumping back to wherever the process left off at checkpoint time.)
- migrate_to <host_name> <port_number> - A process on host <host_name> is listening at <port_number> for a TCP connection. This process should connect to the given port, and write its state information onto the TCP connection exactly as it would for a ckpt command. It is intended that the other process is running the same a.out file that this process is, and will use the state information to effect a restart on the new machine (a migration).
- migrate_from <fd> - This process's parent (the condor_starter) has passed it an open file descriptor <fd> which is in reality a TCP socket. The remote process will write its state information onto the TCP socket, which this process will use to effect a restart. Since the restart is on a different machine, this is a migration.
- exit <status> - The process should exit now with status <status>. This is intended to be issued after a "ckpt" or "migrate_to" command. We don't want to assume that the process should always exit after one of these commands because we want the flexibility to create interim checkpoints. (It's not clear why we would want to want to send a copy of a process to another machine and continue running on the current machine, but we have that flexibility too if we want it...)
- end - We are temporarily at the end of commands. Now it is time to call main() or effect the restart as requested. Note that the process should not close the file descriptor at this point. Instead a signal handler will be set up to handle the signal SIGTSTP. The signal handler will invoke the command stream interpreter again. This is done so that the condor_starter can send a "ckpt" or a "migrate_to" command to the process after it has been running for some time. In the case of the "ckpt" command the name of the file could have been known in advance, but in the case of the "migrate_to" command the name and port number of the remote host are presumably not known until the migration becomes necessary.
- checkpoint-wrapper 4994 bytes added by adesmet on 2012-Nov-16 20:01:39 UTC.
Perl implementation of wrapper implementing Condor Syscall library checkpointing in the vanilla universe.