Proposal Shared Executables Ickpts In The Spool

The ticket #31 tracks work on this.

The problem:

We have actual users using HTCondor-C submit to submit hundreds or thousands of jobs with the same executable. For HTCondor-C copy_to_spool must be true. Furthermore, for HTCondor-C, we can't use clusters to share an executable; each job becomes a new cluster. You end up with one copy of the executable per job. Given a moderately large executable, you can end up with gigabytes of duplicate executables. This is a waste of disk space and bandwidth.

The goal:

When copy_to_spool=true, avoid multiple copies of the same executable in the SPOOL.

(The more general goal would be to avoid multiple copies of any input file in the SPOOL, but that's a harder problem as the input files can potentially be modified.)

The plan:

  1. condor_submit: Do we skip copying to the spool entirely? If yes, none of this matters.

    • is copy_to_spool false?

    • does the cmd include macros, eg $(OpSys) or $(Arch)?

      • Such binaries can eventually end up in the spool. They won't get this benefit.

  2. condor_submit: Calculate a hash of the executable. Insert it into the job's ClassAd . It might looks like CmdHashMD5="d41d8cd98f00b204e9800998ecf8427e".

    • We calculate on the submit side because:

      • It allows the schedd to potentially never look at the file, saving network and schedd processor time for remote submits.

      • submit can be aware that it's submitting the same binary 50 times in different clusters, the schedd can't. This means we can do the hash once on the client instead of 50 times on the server.

    • Arguably there is a risk of calculating it on the client side: the client can lie. Later steps will reveal that we don't link between jobs of different Owners, so if the client lies about the hash, it only hurts themself.

    • The exact name and possibly format of the attribute will be determined later. For the sake of discussion we'll assume it's called CmdHashMD5 for now.

    • We'll be reusing the work done by Ian for signed classads.

  3. condor_schedd: Take a list of things to include; for v1 it will be "CmdHashMD5". Encode the attribute/value list of those things into a string that is a valid filename. See 'Escaping algorithm' below. Store this as CmdHash . eg CmdHash="CmdHashMD5-d41d8cd98f00b204e9800998ecf8427e". If one or more of the "things" is missing (say, it's an old condor_submit), then CmdHash="invalid".

    • Why store a subset of the classad? Extensibility. We can change or add hash algorithms in the future. We can let the administrator or user define their own hash algorithms if they want, say, something fast and risky, or something slow and more secure.

  4. condor_submit to condor_schedd: "May I send you the executable?"

  5. condor_schedd: Escape $(Owner) per the algorithm below. The hash path is now "$SPOOL/exe-$EscapedOwner-$CmdHash". It will look something like "/home/condor/spool/exe-adesmet-CmdHashMD5-d41d8cd98f00b204e9800998ecf8427e". If you don't have a CmdHash at all (Say, an old job from an old schedd), or it's the special value "invalid" the resulting hash path should be a special value of "invalid." If shared executables was disabled, explicitly set the hash path to "invalid."

      • We're prepending the Owner for security reasons. The hash we choose may be broken, allowing an attacker to provide a different binary with the same hash.

        • If we didn't prepend the Owner, we could have a situation like this: The good user might occasionally submit goodjob, which hashes to 0010. The evil user might create eviljob, which also hashes to 0010. The evil user now waits for good user to not have any jobs submitted, then submits eviljob on hold. Now they wait. Next time the good user submits goodjob, they'll actually find and use eviljob!

      • We're prepending 'exe-' so they sort together, not scattered around before and after other files and directories.

      • Handily, a sysadmin unsure which jobs are using which files can ask: condor_q -constraint 'CmdHash=="CmdHashMD5-d41d8cd98f00b204e9800998ecf8427e"'

  6. condor_schedd: Does the hash path find an existing file?

    • No: (The path doesn't exist, or it's invalid) Tell condor_submit: "Send it along." Write the incoming file to the ickpt. Is hash path valid, just not currently existing?

      • No, the hash path is the special value "invalid": Skip to step 7.
      • Yes: Hard link the ickpt to the hash path. Failure to link is okay; we may end up with duplicate copies in the SPOOL, but everything will work fine. Log a warning and move on.
    • Yes: Great, we'll reuse that. Tell condor_submit: "No thanks." Hard link from hash path to the ickpt ("$SPOOL/cluster7524.ickpt.subproc0").

  7. At this point the job is queued and all is well. We'll continue when it's time to remove the ickpt file because the job is leaving the queue for whatever reason:

  8. condor_schedd: Time to remove the job from the queue. Check the hard link count for the ickpt file.
    • If it's 1, this is a job that didn't have a hash.
    • Is it 2? We're the last user, unlink the hash path.
    • If it's 3 or above, there are other users of this file. Leave it be.

  9. condor_schedd: unlink the ickpt file, same as before.

Escaping algorithm:

The CmdHash is a ClassAd , or at least a partial ClassAd serialized into a string. The resulting string must be a valid ClassAd string, and a valid file name on every operating system HTCondor supports (forbidden characters include, but are not limited to: ":/\"). We'll also need to reserve one or more characters to indicate escaped characters and to delimit between fields and records. Furthermore, the resulting CmdHash should be reasonably short, to avoid path limits and to keep sysadmins happy. It would be nice if it could also not look too much like line noise. No two different ClassAds should ever generate the same CmdHash , however, two identical ClassAds must generate the same CmdHash . Reversability of the CmdHash would be nice, but isn't required. We might want to avoid spaces for the sake of naive scripts.

To meet these goals, we'll err on the side of caution and use a whitelist; anything not in the whitelist will be escaped.

The following is just a very rough proposal.

The Class will be written in the form "KEY1-VALUE1-KEY2-VALUE2-KEY3-VALUE3-..." Each KEY and VALUE will be escaped. The order matches the order passed in.

To escape a portion, take each part in turn. For each part, for each character not matched by the regex [A-Za-z0-9_.] will be escaped. To escape a portion, replace it with an "%AA" where "AA" is the hexidecimal ASCII value of the character replaced. So + will map to %2B. This is very similar to URL escaping.

This is woefully non-Unicode ready, but neither is the rest of HTCondor.

Open issues:

  • Can we add more characters to the whitelist? A bigger whitelist means less annoying escaped characters adding noise for the poor admin.
  • Is % a good escape character? Is this a safe character on all of the filesystems we care about? Other possibilities: "(AA)" " _AA_ "
  • Is - a good field seperator character? It should be safe, but it seems likely to appear in keys or values and thus we'll need to escape it. For example, an Owner of "de-smet" is plausible. We might go whole hog and use URL encoding with K1=V1&K2=V2 or K1=V1;K2=V2. Unfortunately both & and ; have special meaning in major Unix shells.

Security implications:

This is insecure in the face of CLAIMTOBE, ANONYMOUS, or any other situation in which multiple users map to the same Owner. Don't do that. Even without this behavior, that's very insecure and users can mess with each other using condor_qedit, condor_rm and more.

There is the specific case where a users use a front end, maybe a web interface, to HTCondor, and behind the scenes their jobs all map to the same user. These users don't have direct access to the schedd, so they can't mess with each others jobs in the queue. Even without this behavior, these jobs can mess with each other; two jobs scheduled on the same machine can mess with each other. (Dedicated run accounts can protect against this, at the cost that jobs don't run as the Owner.) Two jobs scheduled one after the other on the same machine can mess with each other by the earlier process sneaking a rogue process out. (Again, dedicated run accounts solve this.) If this is the situation, shared executable functionality should be disabled. This seems likely to be a very unusual situation.

The default hash will be at least as secure as MD5 or SHA-1. This means the chance of an accidental collision is 2^128. That's basically never. If it's good enough for digital signatures, it's plenty good enough for HTCondor.

That said, there are known attacks on MD5, and similar attacks likely soon against SHA-1. This is (partially) why we're segregating files by user. A user can only attack his own jobs, and if he wants to do that there are simplier ways. We are not expecting to try and defend a user against his own malice.

General notes:

If someone downgrades, everything will still work fine, but lots of exe-username-MD5-asdfasdfasdf files will be left behind. The admin can safely remove all of them without harm! Of course, no one would ever downgrade HTCondor.

If an exe-username-MD5-asdfasdfasdf file is erroneously deleted, the system keeps working. Duplicate copies of the file may later be made, but existing jobs will never notice. (Removing the ickpt file for a given job will break it, but that's the existing situation.)

We're not going to tackle the issue of sharing sandboxes (input and output files) in the spool. There are many complications there: sandbox files can change, necessitating some sort of copy on write system; sandbox ownership changes over time based on the state of jobs. This might be useful future work, but it's a massive task.

By "link" we mean a hard link. On most Unix like systems, we're talking about link(). On Windows, we're talking about CreateHardLink() .

Future work:

Don't try link() if it never works

The schedd could track two things: has link() ever succeeded, and how many times has it failed? If it has failed 10 times (number pulled from the air) and never succeeded, disable this functionality with a warning in the log. This will keep us from wasting time trying to link(), and from spamming the log with warnings.

Eliminate ickpt files

A future goal could be to eliminate the Cluster*.ickpt.* files entirely, leaving only the exe-* files. Reference counting could be maintained entirely within the schedd. Essentially when the query 'CmdHash=="MD5-d41d8cd98f00b204e9800998ecf8427e" && Owner=="bob"' matches no jobs, delete the hash file. The benefits: fewer files in the directory (less likely to hit filessystem limits, easier for an admin to look through), simple directory size tools will work as expected, and it works fine on systems without link() or with a malfunctioning link().

Configuration CmdHash contents

The directions above specify that there is a list of "things" that end up in the CmdHash . For V1 this is a hard coded list, probably of CmdHashMD5. However, we can put the list into the hands of the user or administrator. The user might set CmdHashComponents in his job, or the admin might set CMD_HASH_COMPONENTS in his configuration. The admin's setting would obviously trump the users if present. Why let users and admins override things? Because they may want something more secure. MD5 and SHA-1 may be untrustworthy, so they use a strong encryption algorithm like AES as a hash. They can hand insert that hash into their job and use it. Similarly, they may be willing to live a bit more on the edge; they are submitting huge numbers of jobs in seperate clusters, so they want something faster than a proper secure hash. They might want to use the file's path, size, and timestamp. It's risky, but they're willing to live with it.

Protocol changes

Currently the spool file is sent over using the message CONDOR_SendSpoolFile (10017). To optimize compatibility, we'll add a new message, CONDOR_SendSpoolFile2 (number TDB). If condor_submit detects a new schedd, it wil use the new version, otherwise the old. This ensures backward and forward compatibility.

The new version does not have a filename argument, unlike the old version. (The filename was deprecated in the old version anyway. You had to send it, but it wasn't used.)

TO CONSIDER: The new version will take a mini-classad of arguments for future expansion. It will include ProtocolVersion=1 so future schedds can identify it.

The new version has new return codes:

  • 0 - Failure. Something has gone wrong.
  • 1 - Success. Send the file now.
  • 2 - Success. Don't send the file, continue as though it was transferred successfully.

The return code will never be less than 0 or more than 2.

Open questions:

How does this interact with sandbox chowning?

It shouldn't. The chowning shouldn't change the ownership of the executable.

TODO: Double check

What happens if the job changes the executable and it comes back on exit or evict with other output files?

That works? Still, it shouldn't be a problem. If a modified executable comes back, it should end up back in the sandbox, not in the top level SPOOL.

TODO: Double check

Do we need or want additional garbage collection?

condor_preen uses a simple time based rule to eliminate files (TODO double check). Is that enough? We can easily query all exe- hash files in use by the schedd. On the down side, if the garbage collection is done in a second process (like condor_preen), we add race conditions.

Should CmdHash be protected?

Pros:

  • Stops a user from shooting himself in the foot. Changing your CmdHash is a recipe for madness. You might (depending on when you do it), either change your executable or eliminate it.

  • Helps the admin identify what's in his SPOOL. If the CmdHash has changed, there isn't anything in the schedd indentifying which job claims which hash file. The admin is reduced to checking hard links, which is slow and a nuisance.

  • We can hard code to prepend "exe-$EscapedOwner" to the list of "things". This means the filename is an exact match for the CmdHash value, simplifying queries mapping jobs to files and vice versa. This might make life a bit easier for admins. It will also simplify garbage collection of unused files.

  • We may someday stop having ickpt files (see "Future work" above). If the user can change his own CmdHash , we'll lose track of his ickpt file. This isn't serious; we'll eventually reclaim it on a garbage collection pass, but it does at least temporarily waste disk space.

Cons:

  • Protected attributes add a bit more security complexity that must be maintained basically forever.