Dmtcp Condor Dev

  • User documentation: DmtcpCondor
  • Tickets about DMTCP
  • Releases are now in /p/condor/public/binaries/contrib, named dmtcp_condor_integration-*-Any-Any.tar.gz where * is the version number.
  • The Git repository is in /p/condor/repository/dmtcp_condor.git/ and can be cloned with
    git clone /p/condor/repository/dmtcp_condor.git/
  • When making a new release, you need to change the version number inside of shim_dmtcp.
  • DMTCP has a build option for HTCondor. The normal behavior is to try and checkpoint sockets. When built for HTCondor, DMTCP behaves like HTCondor's checkpointing support: attempts to checkpoint are delayed until all sockets are closed.
  • Some changes to shim_dmtcp come from Michael Hanke, the Debian maintainer for HTCondor (among other packages)

Open tasks:

  • How can we identify platforms that a checkpoint is compatible with? Machine ClassAd attributes of possible interest include CheckpointPlatform and OSKernelRelease.
  • Should the signal to trap in shim_dmtcp should be overridable via a command line option? Are there any circumstances in which we care?
  • The "normal" standard output and error should be from the real job, with outptu from shim_dmtcp and the related tools should go into a user-specified file. At the moment the situation is backward. Note that If shim_dmtcp utterly fails, it should report something to the main stderr.
  • At the moment the script just waits 3 seconds for dmtcp_command to finish whatever it was doing, be it checkpointing or terminating the running processes. (This is the checkpoint function in shim_dmtcp, look for calls to delay.) dmtcp_command returns immediately, but the actual work may take longer. psilord feels 3 seconds is extremely reliable, but it's not guaranteed. Ideal solution: see if dmtcp_command has a --block-until-work-is-really-done option. If not, ask for one upstream.
  • shim_dmtcp line 217 "ckptsig=`${CONDOR_PATH}/bin/condor_status -l $host 2>&1"..., don't bother with egreping, just use -format!
  • shim_dmtcp line 280 "/bin/sleep 2" change to "delay 2"
  • The dmtcp_restart_script.sh symlink renaming code around line 252 in shim_dmtcp exists to work around perceived limitations in HTCondor. dmtcp_restart_script.sh is a symlink, and psilord thinks HTCondor does something terrible, like returning it as a 0-length file. (A copy would be fine!) Verify claim. If true, report bug to HTCondor. If not true and hasn't been true for "a while", remove the code.
  • shim_dmtcp line 328, the "Close the gcb" section, is probably moot in the new GCB free world and can be deleted.
  • shim_dmtcp line 355 (logit "could not start dmtcp_coordinator"): should it "exit 1"? Should we continue? What will happen?
  • shim_dmtcp line 350 (port=`${dmtcp_coordinator} --port 0 --exit-on-last --interval ${CKPTINT} --background 2>&1 | grep "Port:" | /bin/sed -e 's/Port://g' -e 's/[ \t]//g'`) - We just threw away potentially useful output from the dmtcp_coordinator! Write output and error to a temporary file, grep the port out of that, dump contents of temporary file to stdout so we have it for debugging. Also: ask upstream to add a --port-file option that will write the port and nothing else to a file, to simplify extracting it.
  • shim_dmtcp 377 (${dmtcp_checkpoint} --port $port -j "$@" <$STDIN 1>$STDOUT 2>$STDERR &) - Why background this? psilord believes that it is necessary for our signal trap to work. Verify. If true, add a comment noting that.
  • shim_dmtcp 383 (if [ $ret -gt 128 ] ; then) the purpose of this code block is not clear. Something about handling the user executable exiting with a signal. Needs investigation.

  • Improve output messages in shim_dmtcp:
    • line 109 "echo "Terminating..." >&2". Should clarify that it's an internal error, and that getopt failed
    • line 128 "*) echo "Internal error! ($1)"; exit 1;;" - Not an internal error; invalid usage/bad arguments.
    • line 133 "printf "Need at least one argument.\n\n"" - append ", the program to run."
  • Comment typos
    • shim_dmtcp line 195 "# Try to idenitify"..., spell identify.

General system:

  • The user submits a job. Notable changes to their submit file are:
    # Submit the ship_dmtcp instead of their normal job.
    executable=shim_dmtcp
    
    # Those are IN ADDITION to the users "real" binary and input files!
    transfer_input_files = dmtcp_checkpoint,dmtcp_coordinator,\
        dmtcp_command,dmtcp_restart,dmtcphijack.so,libmtcp.so,\
        mtcp_restart
    
    # Argument  Meaning
    # --log     log file name for actions in shim_dmtcp script,
    #           if n/a use /dev/null
    # --stdin   stdin file, if n/a use /dev/null
    # --stdout  stdout file, if n/a use /dev/null
    # --stderr  stderr file, if n/a use /dev/null
    # --ckptint checkpointing interval in seconds
    # 1         the executable name you should have transferred in
    # 2+        arguments to the executable
    #
    # Note that stdout/stderr files are for output from the "real"
    # binaries.  The normal output/error will execlusively be
    # messages from shim_dmtcp and the DMTCP tools.
    arguments = --log shim_dmtcp.$(CLUSTER).$(PROCESS).log \
        --stdin foo.py \
        --stdout job.$(CLUSTER).$(PROCESS).out \
        --stderr job.$(CLUSTER).$(PROCESS).err \
        --ckptint 1800 \
        ./REAL_BINARY example-argument-one example-argument-two
    
    
    # These are all required by DMTCP. JALIB is an internal DMTCP
    # library ("Jason's library").  If your jobs needs more
    # environment options set, just append them.
    environment=DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null;\
        DMTCP_PREFIX_ID=$(CLUSTER)_$(PROCESS)
    
    # On kill, tell our shim to checkpoint. You can change this, but will
    # need to change shim_dmtcp as well
    kill_sig = 2
    
    # If your pool isn't homogenous (nearly identical distributions
    # and updates), your checkpoints may not be portable.  The exact
    # options needed aren't yet knom, but these may work.  Note that
    # you'll need to identify the exact values yourself; these won't
    # work for you!
    Requirements = \
        (CheckpointPlatform == "LINUX INTEL 2.6.x normal 0x40000000"\
        && OSKernelRelease == "2.6.18-128.el5")
    

  • The shim starts up the dmtcp_coordinator. It is given instructions to checkpoint at regular intervals.
  • The shim starts up the user job, sneaking dmtcphijack.so into the runtime. (Probably using LD_PRELOAD?)
  • When restarting from a checkpoint, shim_dmtcp will notice the dmtcp_restart_script.sh created by DMTCP and will run that instead.

Contents of the DMTCP HTCondor integration repository (and tarballs at the moment):

  • shim_dmtcp - The heart of our system. Submit this instead of your job, passing suitable options, to get checkpointing behavior.
  • job.sub - Example submit file using shim_dmtcp; used as part of manual testing.
  • Makefile - Really just for development use. Collects DMTCP files and builds/submits a little test program. Note that the pay to mtcp_restart varies in different distributions.
  • testing and development tools
    • foo.py - Python script used for testing Python under DMTCP
    • foo.c - C program used for testing Python under DMTCP