Experimental Support For Periodic Checkpointing In Vanilla Universe

This documentation is for EXPERIMENTAL features.

Both the feature and the documentation are subject to change without notice.

Introduction

The vanilla universe can now periodically send a signal to the user process. If the process exits successfully after receiving this signal, intermediate file transfer will occur, in order to ensure that progress to that point has been preserved.

Presently, "success" is defined to be "exit with code 0"; this will change in a future revision.

In 8.5.3 (and until 8.7.3), "success" will be defined by three job ad attributes: CheckpointExitBySignal , CheckpointExitSignal , and CheckpointExitCode . If the first is true, then "success" occurs when the process exits on signal CheckpointExitSignal . Otherwise, "success" occurs when the process exits with code CheckpointExitCode . (As this is an experimental feature, you'll need to prepend a '+' to each of the preceeding attributes when you use it in a submit file.)

In 8.7.3 (and later), "success" will be defined in the same way, but the name of the attributes change to SuccessCheckpointExitBySignal , SuccessCheckpointExitSignal , and SuccessCheckpointExitCode , respectively. You will still need to prepend a '+'.

In 8.5.3, you may also set '+WantFTOnCheckpoint = TRUE'. If during the course of normal execution the job exits with "success", as defined above, file transfer will occur as if the job is being evicted.

Presently, intermediate file transfer is defined to be identical to file transfer as if on an eviction, which means that specifying a transfer_output_files list that doesn't include the checkpoint files will break things (the job will restart from scratch if rescheduled on another machine). This also implies that you'll only want to use this feature on a small scale for now -- the checkpoints (which may include the entire sandbox) are uploaded to the schedd's spool directory, which is probably space- and time- wise unwise.

Configuration

Configure the startd for periodic checkpoints as normal for standard universe.

In the job ad, set '+WantCheckpointSignal = TRUE'. You may also set '+CheckpointSig' to the signal you want sent. The signal will otherwise default to the soft kill signal (which defaults to SIGTERM).

If you also want to checkpoint on eviction, set 'when_to_transfer_output = ON_EXIT_OR_EVICT' in the job ad. You may also want to set 'kill_sig' to match '+CheckpointSig' (so that your job will receive its checkpoint signal, rather than its soft-kill signal (usually SIGTERM), when it's evicted).

Random Thoughts

The code currently doesn't update the job ad on a successful checkpoint; it also does not correctly tell the shadow (and therefore the schedd or job log) if the job successfully checkpointed (it will always say it didn't).

The remote usage accounting is probably entirely wrong.