Transfer Daemon Overview

The "TransferDaemon" feature of HTCondor exists in four major parts:

1. A standalone daemon called the TransferD which wraps the FileTransfer object.
2. A "transfer daemon manager" which is an object in the schedd, but could
   be (and probably should be) converted to another standalone daemon. It is
   responsible for starting, stopping, and maintaining the status of each
   transfer daemon associated with a single user.
3. A daemon client object which is used to talk to the TransferDaemon.
4. A TransferRequest object which describes a request to transfer a sandbox.

The Purpose
-----------

The purpose of this feature was three fold:

1. To be able to transfer files to and from HTCondor controlled endpoints
  (which might not be the actual submit machine) asynchronously and under
  priviledge separation scenarios.

2. To decentralize the storage of job sandboxes, spool directories, etc.

3. To actually figure out and be able to schedule network I/O, disk bandwidth,
  etc so we don't crush machines when moving files around.

Overall Design
--------------

Job Submission:

[This functionality is implemented, but turned off suibmit.cpp by the
default value of the STMethod variable.]

When condor_submit connects to the schedd, the schedd uses the TDMan
oject to create/locate a TransferDaemon for that user. The address of the
TD is given back to condor_submit. condor_submit will first up a DaemonClient
to the TD and use the interface to move the sandbox over to it.

Job Execution:

[This is not implemented, but had worked in demo mode when I did it
"by hand". The pieces are present and it would be about 1 month of
coding/integration/testing work before it would function in a production
environment.]

When the schedd starts a job, it starts/locates a TD and tells the shadow
of its location. The shadow contacts the startd as normal and a starter
is created.  The shadow gives the starter the location of the TD instead
of (currently) itself as to where to find the job's files. The starter
then starts its own TD (as the uid/gid of the job itself) and tells it the
location of the TD on the submit side and that it should download the sandbox
to the current working directory.

Additional Facts:

All communication between the transferd and anything else happens with
message packets which are exactly classads.

When TransferRequests are sent to the transferd, either via stdin or via
the control connection, they are enclosed in an envelope called the
"encapsulation method". The envelope is a single ASCII newline delimited
line that is converted to an encapsulation enumeration. Currently, there is
only one, which represents the format of old classads. The additional data
on the fd is in whatever form the encapsulation method dictated it to be.

The call/response ordering of the network communications allows
notification of success/failure at the correct points to the client
for all operations. Reasons for success or failure of a portion of the
protcol are embedded in the return ads.  This method of using classads
for all communication and having correct call/response locations will make
protocol versioning and backwards compatibility much easier to perform.

The TD can be configured to exit if no tranfer requests have been asked
of it.

The TD doesn't have to be long living. It only has to be alive while there
are TransferRequests that need to be handled.

TransferRequest Object (treq)
----------------------

This object, whose defintion and implementation are found in:

condor_includes/condor_transfer_request.h
condor_utils/transfer_request.cpp

is a fundamental object in the design that represents a concrete
desire for transfering files to/from a location. It represents a unit
of work. The treq is used by different parts of the system and so often
only a subset of the API to the object is generally used at any one time.

Notably, the treq contains a SimpleList of classads that are all
associated with the transfer desired. Depending upon the file transfer
protocol specified, they could be jobads or some other kind of ad
describing what to do to some backend of the TransferD. Since the code
only supports the "HTCondor File Transfer Protocol" as defined by the
FileTransfer object, these ads are just a list of jobads.

The class' constructor can take a classad (whose schema is enforced
in TransferRequest::check_schema()) which will initialize all of the
internal variables. This is useful when the treq comes in over the network.

A feature of the treq object that is heavily used by the schedd are
callbacks which can be associated with a TransferRequest. The schedd,
while it is processing the treq, can interpose behavioral changes to the
treq via these callbacks:

pre_push
post_push
update
reaper


These mean, respectively:

pre_push
  Just before giving the treq to a transferd, yoiu have the ability to
  just in time modify the information in the treq. An example is modification
  of the jobads to mark the start time of the transfer.

post_push
  Just after the treq is pushed, call this callback. An example is
  when the schedd tells the client concrete information about the treq
  that the TransferDaemon provided to the schedd after accepting it.

update
  When a transferd finished transferring some files, it sends an update to
  the schedd, this is the callback which gets called in that scenario.
  An example is removing the job off of hold on a successful transfer.

reaper
  When the transferd exits, this will be the associated callback for the
  pending treq.

The behavioral changes that callbacks can decide in the processing of
the treq can be:

TREQ_ACTION_CONTINUE
 Keep processing the treq to the next stage.

TREQ_ACTION_FORGET
 Stop processing the treq, remove it from all structures, but don't free it.
 This assumes the callback took control of the memory.

TREQ_ACTION_TERMINATE
 Remove the treq from all consideration and then free it.