Creating Condor Pools At Amazon

One approach to using cloud resources to run HTCondor jobs is to use the condor_annex tool to expand an existing pool onto the cloud (see HowToUseCondorAnnexWithOnDemandInstances ). Another approach, documented here, is to create a new HTCondor pool entirely in the cloud. Although this approach doesn't allow jobs already in a queue to run on cloud resources, it has the advantage that all file-transfer between the queue and the execute nodes occurs within the cloud, which could lead to substantial performance improvements and cost reductions.

The HTCondor team maintains an AWS Marketplace entry to help simplify the process of creating an HTCondor pool at Amazon.

To follow these instructions you must already have an AWS account.

Log into AWS

The first step is to log in to AWS .

Overview

The general approach will be to use the Marketplace entry to start a head node , which will be the brains of the new HTCondor pool, as well as be where you'll log in to and submit jobs from. Once the head node is up and running, you'll use condor_annex to add cloud resources to your new pool. Then you can start running jobs, and when they're done, shut everything down.

  1. Launch HTCondor in the Cloud
    1. Create a Key Pair
    2. Start a Head Node
  2. Add Cloud Resources to Your New Pool
    1. Log into your Head Node
    2. Obtain an Access Key
    3. Prepare Account
    4. Add Cloud Resources
  3. Run Jobs
  4. Clean Up
    1. The Cloud Resources
    2. The Head Node

1 Launch HTCondor in the Cloud

In this section, you'll launch HTCondor in the Cloud by starting a head node. A head node needs an address and, for security, a lock. AWS will automatically provide the address, but you need to do a little work to make sure the lock is one for which you have a key. For technical reasons, AWS refers to these lock/key pairs as just "key pairs". If you already have one (if, for instance, you're following these instructions a second time), you can skip to section 1.2, but creating another key pair won't cause problems.

1.1 Create a Key Pair

  1. Go to the EC2 key pair console .
  2. Click the blue "Create Key Pair" button in the upper left.
  3. Enter a name; "HTCondorKeyPair" would be fine. Click the blue "Create" button.
  4. Your browser will probably bring up a dialog box asking you what to do with "HTCondorKeyPair.pem". (It may just start saving it for you.) Save it some place you won't accidentally delete it and make a note of the location.

When you later connect to your head node, you'll need to know the location of "HTCondorKeyPair.pem" so you can specify that file.

1.2 Start a Head Node

  1. Open HTCondor's Marketplace entry in another tab.
  2. Click the orange 'Continue to Subscribe' button.
  3. Click the orange 'Accept Terms' button to actually subscribe.
  4. You'll have to wait a while before you can click on the orange 'Continue to Configuration' button.
  5. Click the orange 'Continue to Launch' button to accept the default configuration.
  6. Scroll down to 'Security Group Settings' and click on the 'Create New Based On Seller Settings' button. You'll have to name and escribe the security group; "HTCondor Security Group" would be a fine name. "Allows SSH and HTCondor from anywhere" is a true description. (It is more secure to change the two drop-down boxes from 'Anywhere' to 'My Ip', but that doesn't always work, especially if you're on a laptop.) Finally, click 'Save'.
  7. Scroll down to 'Key Pair Settings' and make sure the key pair you created in section 1.1 is selected.
  8. You will start spending money when you click on the orange 'Launch' button.
  9. Click on the 'Your Software' link in the green box. Leave that tab open; you'll need it later.

2 Add Cloud Resources to Your New Pool

Your head node (by default) starts with two CPUs and 8 GiB of RAM. Immediately after you log in -- see section 2.1 -- you can start submitting and running jobs (see section 3). However, with only two CPUs, you'll only be able to run two jobs at a time. To add cloud resources to your pool, you'll use condor_annex . To do that, you'll need to obtain an access key for condor_annex , so condor_annex can act interact with AWS on your behalf. You'll only need to do that once. Likewise, condor_annex has to do some one-time set-up for each account. Once that's done, you can run condor_annex to add cloud resources as often as you like.

2.1 Log into your Head Node

  1. Find the HTCondor entry (in the tab from section 1.2) and click the 'View Instances' button.
  2. There should only be one instance; click on the "Manage in AWS Console" link. This will bring up the EC2 console with your head node selected.
  3. Right-click on the selected instance and select 'Connect'. Follow the instructions, except that the username is 'ec2-user', not 'root'.

2.2 Obtain an Access Key

Just being able to log into an EC2 instance doesn't give you the privilege to start additional EC2 instances. In order to use add cloud resources to your new pool, HTCondor needs a pair of security tokens (like a user name and password). Like a user name, the "access key" is (more or less) public information; the corresponding "secret key" is like a password and must be kept a secret. To help keep both halves secret, you never tell HTCondor these keys directly; instead, you tell HTCondor which file to look in to find each one.

Create those two files now; we'll tell you how to fill them in shortly. By convention, these files exist in your ~/.condor directory. In this document, shaded boxes indicate typing in a terminal; you should copy the lines beginning with $, but don't include the $. The other lines shown in the boxes are the responses you should expect to see.

$ cd ~/.condor
$ touch publicKeyFile privateKeyFile
$ chmod 600 publicKeyFile privateKeyFile

The last command ensures that only you can read or write to those files.

If you saved your security tokens from the last time you used condor_annex , copy them into the files you just created and skip to section 2.3.

To download a new pair of security tokens for condor_annex to use, go to the IAM console .

2.2.1 Privileged (Administrator) Accounts

The following instructions assume you are logged in as a user with the privilege to create new users. (The 'root' user for any account has this privilege; other accounts may as well.) If your account has more limited privileges, skip to section 2.2.2.

  1. Click the "Add User" button.
  2. Enter name in the User name box; "annex-user" is a fine choice.
  3. Click the check box labelled "Programmatic access".
  4. Click the button labelled "Next: Permissions".
  5. Select "Attach existing policies directly".
  6. Type " AdministratorAccess " in the box labelled "Filter".
  7. Click the check box on the single line that will appear below (labelled " AdministratorAccess ").
  8. Click the "Next: review" button (you may need to scroll down).
  9. Click the "Create user" button.
  10. From the line labelled "annex-user", copy the value in the column labelled "Access key ID" to publicKeyFile . Also copy this value to your laptop or desktop computer; you'll want to have the value if you use condor_annex again.
  11. On the line labelled "annex-user", click the "Show" link in the column labelled "Secret access key"; copy the revealed value to privateKeyFile . Also copy this value your laptop or desktop computer; you'll want to have the value if you use condor_annex again.
  12. Hit the "Close" button.

The 'annex-user' now has full privileges to your account. We're working on creating a CloudFormation template that will create a user with only the privileges condor_annex actually needs. Skip to section 2.3.

2.2.2 Non-Privileged (Non-Administrator) Accounts

If you're using an account with limited privileges, your administrator may have already given you the credentials. If not, you may be able to create credentials for yourself at the IAM console .

  1. Click on your user name.
  2. Click on the "security credentials" tab.
  3. Click the "Create access key" button.
  4. Copy the value in the column labelled "Access key ID" to publicKeyFile . Also copy this value to your laptop or desktop computer; you'll want to have the value if you use condor_annex again.
  5. Click the "Show" link in the column labelled "Secret access key"; copy the revealed value to privateKeyFile . Also copy this value to your laptop or desktop computer; you'll want to have the value if you use condor_annex again.
  6. Hit the "Close" button.

2.3 Prepare your Account

The following command will prepare your AWS account for condor_annex . condor_annex will create a number of persistent components, none of which will cost you anything to keep around. These components can take quite some time to create; condor_annex checks each for completion every ten seconds and prints an additional dot (past the first three) each time, to let you know that everything's still working.

$ condor_annex -setup
Creating configuration bucket (this takes less than a minute)....... complete.
Creating Lambda functions (this takes about a minute)........ complete.
Creating instance profile (this takes about two minutes)................... complete.
Creating security group (this takes less than a minute)..... complete.
Setup successful.

2.3.1 Verify Account Preparation

You can verify at this point (or any later time) that the account-preparation procedure completed successfully by running the following command.

$ condor_annex -check-setup
Checking for configuration bucket... OK.
Checking for Lambda functions... OK.
Checking for instance profile... OK.
Checking for security group... OK.
Your setup looks OK.

2.4 Add Cloud Resources

Run the following command; if you type 'yes', condor_annex will add ten instances to the pool for no more than 24 hours:

$ condor_annex -count 10 -duration 24 -annex-name MyFirstAnnex
Will request 10 m4.large on-demand instance for 24 hours.  Each instance will terminate after being idle for 0.25 hours.
Is that OK?  (Type 'yes' or 'no'): yes
Starting annex...
Annex started.  Its identity with the cloud provider is
'MyFirstAnnex_f2923fd1-3cad-47f3-8e19-fff9988ddacf'.  It will take about three minutes for the new machines to join the pool.

Read the complete introduction for more information; skip the "Using HTCondor Annex for the First Time" section, since you already have.

Complete documentation is also available.

3 Run Jobs

If you're new to HTCondor, you should consult the quick start guide .

You may also wish to read the slides from our user tutorial.

  • To run on your new resources, a job's submit file must contain the following line:

+MayUseAWS = TRUE

  • The new resources do not share a file system with the head node, so you'll need to use file transfer:

should_transfer_files = TRUE

4 Clean Up

4.1 The Cloud Resources

One of the benefits of using condor_annex is that the cloud resources in acquires will automatically terminate after a certain amount of time (24 hours in the example above). This happens even if the instance is running a job at the time, to make sure that misbehaving jobs don't cause you to spend more than you intended. Additionally, if at any time it's been too long (15 minutes by default) since an instance ran a job, the instance will shut itself down to save you money. However, if you'd like to shut down the instances early, you can do so using the condor_off command, replacing MyFirstAnnex with the name of the annex you'd like to shut down:

$ condor_off -annex MyFirstAnnex

4.2 The Head Node

As noted above, you'll need to clean the head node up yourself. If you don't want to keep any of your changes, then you should "terminate" the head node to avoid paying for storage. If you just want to save money and pick up where you left off a bit later, you should instead "stop" the head node; you'll pay to keep its disk around until you start the head node again later. Both options are under "Instance State" if you right-click on the instance in the EC2 console.