How To Rendezvous In Parallel Universe

How to Rendezvous in the HTCondor Parallel Universe

Introduction

When running parallel universe jobs, often one node of a job would like to send some small amount of data to the other nodes in the job. This can be an IP address and port, or a filename, or some other object to synchronize on. HTCondor chirp command makes this straightforward. The basic idea is that the sending process writes the information into the job ad with condor_chirp sett_job_attr, and the receivers poll this information from the job classad with condor_chirp get_job_attr.

Example submit file

Assuming that the dedicated scheduler has been configured, create a submit file like this:

universe = parallel

executable = run.sh

should_transfer_files = yes
when_to_transfer_output = on_exit
getenv = true

output = out.$(Node)
error  = err.$(Node)
log    = log

machine_count = 2
+WantIOProxy = true
queue

The run.sh script

The run.sh script first checks to see if it is the client or server. Arbitrarily, we pick node 0 for the server, and 1 for the client. Node 0 picks a string, inserts it into the job ad attribute called JobServerAddress , and sleeps. Node 1 polls for this value, prints it out when it is available, then exits. Obviously, this value could be the address of a port listening on, or anything else the two processes would like to synchronize on.

#!/bin/sh

CONDOR_CHIRP=`condor_config_val LIBEXEC`/condor_chirp

get_job_attr_blocking() {
    while /bin/true
    do
        JobServerAddress=`$CONDOR_CHIRP get_job_attr $1`
        if [ $? -ne 0 ]; then
            echo "Chirp is broken!" 1>&2
            exit 1
        fi
        if [ "$JobServerAddress" != "UNDEFINED" ]; then
            echo "$JobServerAddress"
            exit 0
        fi
        sleep 2
    done
}

if [[ $_CONDOR_PROCNO = "0" ]]
then
    $CONDOR_CHIRP set_job_attr JobServerAddress '"This is my address"'

    # Do other server things here...
    sleep 60
    exit 0
fi


if [[ $_CONDOR_PROCNO = "1" ]]
then
    JobServerAddress=`get_job_attr_blocking JobServerAddress`
    if [ $? -ne 0 ]; then
        echo "Chirp is broken"
        exit 1
    fi

    # I've got the address, do client things here like connect to the server..
    echo "JobServerAddress is $JobServerAddress"
    exit 0
fi

exit 0

Copying a file from one node to another

Here is a more sophisticated example, using the nc program (netcat) to copy a file from one machine to another. Note that there is no security at all in this file copy, any process on the network can connect and send data to the listening process, so you'd only want to run this on a trusted, protected network.

The submit file is the same as above, but the run.sh is a bit more involved:

#!/bin/sh

CONDOR_CHIRP=`condor_config_val LIBEXEC`/condor_chirp

get_job_attr_blocking() {
    while /bin/true
    do
        JobServerAddress=`$CONDOR_CHIRP get_job_attr $1`
        if [ $? -ne 0 ]; then
            echo "Chirp is broken!" 1>&2
            exit 1
        fi
        if [ "$JobServerAddress" != "UNDEFINED" ]; then
            echo "$JobServerAddress"
            exit 0
        fi
        sleep 2
    done
}

HOSTNAME=`hostname`

echo "I am $HOSTNAME, node $_CONDOR_PROCNO"

if [[ $_CONDOR_PROCNO = "0" ]]
then
    echo "I'm the server"

    # Start netcat listening on an ephemeral port (0 means kernel picks port)
    #    It will wait for a connection, then write that data to output_file
    # -d is needed or else it will give up too early.
    nc -d -l 0 > output_file &

    # pid of the nc running in the background
    NCPID=$!

    # Sleep a bit to ensure nc is running
    sleep 2

    # parse the actual port selected from netstat output
    NCPORT=`
        netstat -t -a -p 2>/dev/null |
        grep " $NCPID/nc" |
        awk -F: '{print $2}' | awk '{print $1}'
    `


    echo "Listening on $HOSTNAME $NCPORT"
    $CONDOR_CHIRP set_job_attr JobServerAddress \"${HOSTNAME}\ ${NCPORT}\"

    # Do other server things here...
    sleep 60
    echo "Here's the output:"
    cat output_file
    exit 0
fi


if [[ $_CONDOR_PROCNO = "1" ]]
then
    echo "I'm the client"

    JobServerAddress=`get_job_attr_blocking JobServerAddress`
    if [ $? -ne 0 ]; then
        echo "Chirp is broken"
        exit 1
    fi
    echo "JobServerAddress: $JobServerAddress";

    host=`echo $JobServerAddress | tr -d '"' | awk '{print $1}'`
    port=`echo $JobServerAddress | tr -d '"' | awk '{print $2}'`

    echo "Sending to $host $port"
    nc $host $port < /etc/hosts
    echo "Sent $?"

    # Don't do any other client stuff, just exit
    exit 0
fi

exit 0