The "TransferDaemon" feature of HTCondor exists in four major parts: 1. A standalone daemon called the TransferD which wraps the FileTransfer object. 2. A "transfer daemon manager" which is an object in the schedd, but could be (and probably should be) converted to another standalone daemon. It is responsible for starting, stopping, and maintaining the status of each transfer daemon associated with a single user. 3. A daemon client object which is used to talk to the TransferDaemon. 4. A TransferRequest object which describes a request to transfer a sandbox. The Purpose ----------- The purpose of this feature was three fold: 1. To be able to transfer files to and from HTCondor controlled endpoints (which might not be the actual submit machine) asynchronously and under priviledge separation scenarios. 2. To decentralize the storage of job sandboxes, spool directories, etc. 3. To actually figure out and be able to schedule network I/O, disk bandwidth, etc so we don't crush machines when moving files around. Overall Design -------------- Job Submission: [This functionality is implemented, but turned off suibmit.cpp by the default value of the STMethod variable.] When condor_submit connects to the schedd, the schedd uses the TDMan oject to create/locate a TransferDaemon for that user. The address of the TD is given back to condor_submit. condor_submit will first up a DaemonClient to the TD and use the interface to move the sandbox over to it. Job Execution: [This is not implemented, but had worked in demo mode when I did it "by hand". The pieces are present and it would be about 1 month of coding/integration/testing work before it would function in a production environment.] When the schedd starts a job, it starts/locates a TD and tells the shadow of its location. The shadow contacts the startd as normal and a starter is created. The shadow gives the starter the location of the TD instead of (currently) itself as to where to find the job's files. The starter then starts its own TD (as the uid/gid of the job itself) and tells it the location of the TD on the submit side and that it should download the sandbox to the current working directory. Additional Facts: All communication between the transferd and anything else happens with message packets which are exactly classads. When TransferRequests are sent to the transferd, either via stdin or via the control connection, they are enclosed in an envelope called the "encapsulation method". The envelope is a single ASCII newline delimited line that is converted to an encapsulation enumeration. Currently, there is only one, which represents the format of old classads. The additional data on the fd is in whatever form the encapsulation method dictated it to be. The call/response ordering of the network communications allows notification of success/failure at the correct points to the client for all operations. Reasons for success or failure of a portion of the protcol are embedded in the return ads. This method of using classads for all communication and having correct call/response locations will make protocol versioning and backwards compatibility much easier to perform. The TD can be configured to exit if no tranfer requests have been asked of it. The TD doesn't have to be long living. It only has to be alive while there are TransferRequests that need to be handled. TransferRequest Object (treq) ---------------------- This object, whose defintion and implementation are found in: condor_includes/condor_transfer_request.h condor_utils/transfer_request.cpp is a fundamental object in the design that represents a concrete desire for transfering files to/from a location. It represents a unit of work. The treq is used by different parts of the system and so often only a subset of the API to the object is generally used at any one time. Notably, the treq contains a SimpleList of classads that are all associated with the transfer desired. Depending upon the file transfer protocol specified, they could be jobads or some other kind of ad describing what to do to some backend of the TransferD. Since the code only supports the "HTCondor File Transfer Protocol" as defined by the FileTransfer object, these ads are just a list of jobads. The class' constructor can take a classad (whose schema is enforced in TransferRequest::check_schema()) which will initialize all of the internal variables. This is useful when the treq comes in over the network. A feature of the treq object that is heavily used by the schedd are callbacks which can be associated with a TransferRequest. The schedd, while it is processing the treq, can interpose behavioral changes to the treq via these callbacks: pre_push post_push update reaper These mean, respectively: pre_push Just before giving the treq to a transferd, yoiu have the ability to just in time modify the information in the treq. An example is modification of the jobads to mark the start time of the transfer. post_push Just after the treq is pushed, call this callback. An example is when the schedd tells the client concrete information about the treq that the TransferDaemon provided to the schedd after accepting it. update When a transferd finished transferring some files, it sends an update to the schedd, this is the callback which gets called in that scenario. An example is removing the job off of hold on a successful transfer. reaper When the transferd exits, this will be the associated callback for the pending treq. The behavioral changes that callbacks can decide in the processing of the treq can be: TREQ_ACTION_CONTINUE Keep processing the treq to the next stage. TREQ_ACTION_FORGET Stop processing the treq, remove it from all structures, but don't free it. This assumes the callback took control of the memory. TREQ_ACTION_TERMINATE Remove the treq from all consideration and then free it.