Using Cooley Annex On Chtc

Introduction

We are developing a way to run targeted CHTC jobs on the Cooley cluster at ALCF (Argonne Leadership Computing Facility). This is done via "hobble-in", a human-assisted form of glide-in that submits HTCondor startds as user jobs into Cooley's batch scheduler, Cobalt. The setup mimics how the condor annex allocates resources from AWS to run CHTC jobs ( UsingCondorAnnexOnChtc ).

Each machine in Cooley has the following resources:

  • Two 2.4 GHz Intel Haswell E5-2620 v3 processors (6 cores per CPU, 12 cores total)
  • One NVIDIA Tesla K80 (with two GPUs)
  • 384GB RAM, 24 GB GPU RAM (12 GB per GPU)
  • FDR Infiniband interconnect
  • 345GB local scratch space
  • Red Hat Enterprise Linux Server release 6.8 (Santiago)

User Steps

Similar to other resources outside of CHTC, running on Cooley will not happen by default. Jobs need to meet certain conditions regarding where and how they're submitted.

Required

Jobs that want to be able to run on Cooley need to be submitted to a schedd configured to flock to the CHTC Condor Annex collector. Currently, that means submitting from these machines:

  • submit-1.chtc.wisc.edu
  • submit-4.chtc.wisc.edu

The jobs' resource requests must not exceed the available resources of a Cooley machine (detailed above).

The jobs must indicate that they want to run on resources outside of CHTC and that they're willing to run on Cooley specifically. To do so, the user must add the following lines to their submit file:

+WantFlocking = True
+MayUseCooley = True

Optional

If the user wants the jobs to run only on Cooley, they can add a clause to the Requirements expression:

requirements = Facility =?= href="/wiki-archive/pages/Cooley" 

If the user wants the jobs to run only on a specific Cooley annex, they can add a different clause to the Requirements expression:

requirements = AnnexName =?= href="/wiki-archive/pages/Test1" 

These two additions to the Requirements expression can be used individually or together.

Staff Steps

Submitting jobs to Cooley is done from the login machines (hostname cooley.alcf.anl.gov) using the Cobalt batch scheduler. Access to theses machines is done via ssh using two-factor authentication. Thus, submission and management of Cooley jobs must be done by a human who has a login account.

Before submitting Cooley jobs, a CHTC staff member must configure the SoftEnv system so that the proper CUDA libraries can be found by HTCondor and user jobs. To do so, they must create the file ~/.soft.cooley with the following contents:

+mvapich2
+cuda-9.0.176
@default

Once logged into a Cooley login machine, a CHTC staff member can submit a new hobble-in job with a command like this:

% /home/jfrey/hobblein/bin/hobblein_submit 2 24:00:00 Test1 jfrey
1337326
%

This will submit a hobble-in job that uses 2 machines, runs for 24 hours, has an annex name of Test1, and will only run jobs owned by user jfrey. If the username argument is omitted, then the hobble-in will run jobs from any user.

The command will print out the Cobalt job id (1337326 in this case). The status of the job can be checked via Cobalt with this command:

% qstat
JobID    User   WallTime  Nodes  State    Location
=======================================================
1337326  jfrey  00:10:00  2      running  cc024,cc124
%

You can check the status of all jobs you've submitted as well:

% qstat -u jfrey
JobID    User   WallTime  Nodes  State     Location
========================================================
1337314  jfrey  00:10:00  2      exiting   cc028,cc031
1337326  jfrey  00:10:00  2      starting  cc024,cc124
%

You can also delete a job from Cobalt:

% qdel 1337326
      Deleted Jobs
JobID    User
================
1337326  jfrey
%

While the hobble-in job is running at Cooley, machine ads will appear in the CHTC annex collector. They can be queried using the annex name specified to hobblein_submit.

% condor_status -pool annex-cm.chtc.wisc.edu -annex Test1
Name                         OpSys      Arch   State     Activity LoadAv Mem

slot1@cc075.fst.alcf.anl.gov LINUX      X86_64 Unclaimed Idle      0.000 387137
slot1@cc081.fst.alcf.anl.gov LINUX      X86_64 Unclaimed Idle      0.000 387137

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX        2     0       0         2       0          0      0

         Total        2     0       0         2       0          0      0
%

The startds can be shut down like with a regular annex. This must be done as the root user on annex-cm.chtc.wisc.edu, due to the current security configuration.

# condor_off -annex Test1
Sent "Kill-Daemon" command for "master" to master jfrey@cc081.fst.alcf.anl.gov
Sent "Kill-Daemon" command for "master" to master jfrey@cc075.fst.alcf.anl.gov
#