Next: 4.1 Remote File Access Up: Checkpoint and Migration of Previous: 3.6 Processor State

4 Summary of Checkpoint and Restart

Now that we've seen some of the details, we can look at the ``big picture'' of how checkpoint/restart is accomplished. When our original process is born, condor code installs a handler for the checkpointing signal, initializes its data structures, then calls the user's main() routine. At some arbitrary point during execution of the user code, the checkpoint signal is delivered invoking the checkpoint() routine. This routine records information about open files and our current location on the stack in the data area, writes the data, shared library, and stack areas into a checkpoint file, then the process either immediately exits or returns from the signal handler and continues processing (depending on whether the checkpoint signal was a vacate signal or a periodic checkpoint signal). At restart time condor executes the same program with a special set of arguments. The special arguments cause the restart() routine to be called instead of the user's main(). The restart() routine overwrites its own data segment with that stored in the checkpoint file. Now it has the list of open files, signal handlers, etc. in its own data space, and restores those parts of the state. Next, it switches its stack to a temporary location in the data space, and overwrites its own process's stack with that saved in the checkpoint file. The restart() routine then returns to the stack location that was current at the time of the checkpoint(), i.e. the restart() routine returns to the checkpoint() routine. Now the checkpoint() routine returns, but recall that this routine is a signal handler - it therefore returns to the user code that was interrupted by the checkpoint signal. The user code resumes from where it left off, and is none the wiser.

condor-admin@cs.wisc.edu