next up previous
Next: 2 Checkpoint and Migration Up: Checkpoint and Migration of Previous: Checkpoint and Migration of

1 Introduction

Condor is a distributed batch processing system for UNIX developed at the University of Wisconsin. This system schedules jobs on idle workstations in a network, resulting in more efficient resource utilization. It is of primary importance in Condor to ensure that the owner of a workstation does not pay a penalty for adding his or her workstation to the Condor pool of workstations. So, a job must have the ability to immediately vacate a workstation when the owner begins to use it, and either migrate to another idle workstation or queue until one becomes idle.

To allow migrating jobs to make progress, Condor must be able to start the vacated job from where it left off. Condor does this by writing a checkpoint of the process's state before vacating. A checkpoint file contains the process's data and stack segments, as well as information about open files, pending signals, and CPU state. Condor gives a program the ability to checkpoint itself by providing a checkpointing library. Programs submitted to be run by the Condor system are re-linked (but not re-compiled) to include this library.