Fermilab Condor-G Hands-On

fnpcsrv1.fnal.gov has been configured for these hands-on exercises. You should be able to log in with your normal Fermilab Kerberos credentials. We'll need your Kerberos credential on fnpcsrv1.fnal.gov, so be sure to pass "-f" to kinit to create a forwardable credential.

Notation

Example text	What you see on your screen.
condor_q	Something to type in.
[Ctrl+D]	Instead of typing these characters in, instead press Ctrl and D together.
universe = grid	Marks a change from something seen before for comparison.
adesmet	Replace this as documented above the example.

Generating an x509 proxy

Before you can do anything, you need an "x509 proxy" to identify yourself to grid resources. Some organizations, like the DOE, might have you register with them and their Certificate Authority will issue you an x509 certificate. But Fermilab maintains their own system that can issue x509 certificates using the Kerberos system.

Summary

kx509
kxlist -p
source /usr/local/vdt/setup.sh
voms-proxy-init -noregen -voms fermilab:/fermilab

What is going on

First, we'll set up the VDT (Virtual Data Toolkit) tools. This gives us access to the voms-proxy-* tools that we'll use later:

$ source /usr/local/vdt/setup.sh

Note that if you're using csh, tsch, or another csh derived shell, you will instead use this command:

$ source /usr/local/vdt/setup.csh

Now that we're set up, let's see what Keberos credentials we have.

$ klist
Ticket cache: /tmp/krb5cc_10389_e15383
Default principal: adesmet@FNAL.GOV

Valid starting     Expires            Service principal
07/23/10 22:19:03  07/25/10 00:08:40  krbtgt/FNAL.GOV@FNAL.GOV
        renew until 07/30/10 22:08:40
07/23/10 22:19:04  07/25/10 00:08:40  afs@FNAL.GOV
        renew until 07/30/10 22:08:40

Now we generate an x509 certificate:

$ kx509

And let's check to see what changed:

$ klist
Ticket cache: /tmp/krb5cc_10389_e15383
Default principal: adesmet@FNAL.GOV

Valid starting     Expires            Service principal
07/23/10 22:19:03  07/25/10 00:08:40  krbtgt/FNAL.GOV@FNAL.GOV
        renew until 07/30/10 22:08:40
07/23/10 22:19:04  07/25/10 00:08:40  afs@FNAL.GOV
        renew until 07/30/10 22:08:40
07/23/10 22:19:19  07/25/10 00:08:40  kca_service/winserver.fnal.gov@FNAL.GOV
        renew until 07/30/10 22:08:40
07/22/10 22:19:19  07/30/10 22:08:40  kx509/certificate@FNAL.GOV

We can look at the x509 certificate that was generated.

$ kxlist
Service kx509/certificate
 issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM
 subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
 serial=01C05555
 hash=e7635e83

voms-proxy-info gives us information about our x509 proxy. At the moment we don't have anything:

$ voms-proxy-info

Couldn't find a valid proxy.

We generated an x509 certificate, but it's in your Kerberos cache, which isn't useful to us. Fortunately, kxlist can generate a proxy for us:

$ kxlist -p
Service kx509/certificate
 issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM
 subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
 serial=01C05555
 hash=e7635e83

We can now look at it:

$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: VOMS extension not found!
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM
type      : unknown
strength  : 1024 bits
path      : /tmp/x509up_u10389
timeleft  : 167:48:40

"VOMS extension not found!" That's no good. It doesn't have the VOMS information associated with it. Fortunately we can use this proxy to create a new one with the VOMS information. "-noregen" means "use my existing proxy to make the new one."

$ voms-proxy-init -noregen -voms fermilab:/fermilab
Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
Contacting  voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done
Creating proxy ............................................................... Done
Your proxy is valid until Sat Jul 24 10:20:13 2010

We now have a proxy with the VOMS information attached!

$ voms-proxy-info -all
WARNING: Unable to verify signature! Server certificate possibly not installed.
Error: Cannot verify AC signature!
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet/CN=proxy
issuer    : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
identity  : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
type      : proxy
strength  : 1024 bits
path      : /tmp/x509up_u10389
timeleft  : 11:59:54
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet
issuer    : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/nees/Role=NULL/Capability=NULL
timeleft  : 11:59:53
uri       : voms.fnal.gov:15001

Testing Globus

We can use globusrun to test if a Globus gatekeeper is responding and willing to accept our proxy:

$ globusrun -a -r fgitbgkc2.fnal.gov/jobmanager-condor

GRAM Authentication test successful

We can use globus-job-run to run executables on the remote server. Note that the executable we want to run must already be on the remote server. globus-job-run will block until the executable finishes, returning the output.

$ globus-job-run fgitbgkc2.fnal.gov/jobmanager-fork /bin/date
Fri Jul 23 22:47:09 CDT 2010
$ globus-job-run fgitbgkc2.fnal.gov/jobmanager-fork /bin/hostname
fgitbgkc2.fnal.gov

Running Jobs By Hand

Feel free to try this later, but right now it's not recommended. The final step can take a long time to finish.

First, submit our job. The output is an identifier for our job.

$ globus-job-submit fgitbgkc2.fnal.gov/jobmanager-condor /bin/date 
https://fgitbgkc2.fnal.gov:47738/29075/1279943334/

Now call globus-job-status every once in a while until it is "DONE". Replace the URL below with the URL returned by globus-job-submit.

$ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
PENDING
$ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
PENDING
$ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
ACTIVE
$ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
DONE

When it's done, we can fetch the output back:

$ globus-job-get-output https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
Fri Jul 23 22:48:55 CDT 2010

Finally, we should free up the resources we have allocated on the remote server. Note that this step can take a long time!

$ globus-job-clean https://fgitbgkc2.fnal.gov:47738/29075/1279943334/

    WARNING: Cleaning a job means:
        - Kill the job if it still running, and
        - Remove the cached output on the remote resource

    Are you sure you want to cleanup the job now (Y/N) ?
y
Cleanup successful.

This example doesn't send your executable or any data files over, and doesn't retrieve any output files. Doing so adds even more complexity. Condor-G can worry about those details for you!

Running a "vanilla" Condor job

Make a scratch directory for this work:

$ mkdir condor-test
$ cd condor-test

Create a test executable. Note that where is says "[Ctrl+D]", you need to hit Ctrl+D, not type those words.

$ cat > sample-job
#! /bin/sh
/bin/date
/bin/hostname
/bin/echo Sleeping for $1 seconds
/bin/sleep $1
/bin/cat sample-input > sample-output
/bin/echo Done
[Ctrl+D]
$ chmod a+x sample-job

Create a sample input file.

$ echo This is a test file > sample-input

Verify that it works:

$ ./sample-job 1
Fri Jul 23 22:57:31 CDT 2010
fnpcsrv1.fnal.gov
Done
$ cat sample-output
This is a test file

And clean up:

$ rm sample-output

Describe our job for Condor. The spaces lining everything up are optional.

$ cat > sample-submit
universe     = scheduler
notification = never
executable   = sample-job
arguments    = 60
output       = sample-out
error        = sample-err
log          = sample-log
transfer_input_files = sample-input
when_to_transfer_output = ON_EXIT
queue
[Ctrl+D]

What's going on in that submit file?

Command Description

universe = scheduler We specified "universe=scheduler". Normally you will want "universe=vanilla". vanilla jobs will run as normal on your Condor cluster. Depending on how busy your cluster is, that may take a while, Furthermore, for this hands-on, we're not actually connected to a full cluster, so vanilla jobs won't ever run. So we're use "scheduler", which is a simplified version that runs on the local computer immediately. Do not normally use the scheduler universe! If not specified you get the "standard" universe, which has very special requirements.

notification = never Don't send us email about how our job is doing. If not specified, Condor will try to send you email if you job completes or has a problem.

executable = sample-job Required. The name of the executable we want to run

arguments = 60 Command line arguments to pass to the executable. If not specified, no arguments will be passed in.

output = sample-out Anything written to standard output will be placed into this file. If not specified, output will be discarded.

error = sample-err Anything written to standard error will be placed into this file. If not specified, error messages will be discarded.

log = sample-log Write notes about how your job is doing to this file. If not specified, you don't get these log messages. This is highly recommended, as the log file can help diagnose problems.

transfer_input_files = sample-input These files will be copied to the execute computer and made available to your program. If not specified, no files are copied over.

when_to_transfer_output = ON_EXIT Required if transfer_input_files is specified. Tells Condor that when the job is done.

queue Required. Tells Condor you're done.

Command	Description
universe	= scheduler	We specified "universe=scheduler". Normally you will want "universe=vanilla". vanilla jobs will run as normal on your Condor cluster. Depending on how busy your cluster is, that may take a while, Furthermore, for this hands-on, we're not actually connected to a full cluster, so vanilla jobs won't ever run. So we're use "scheduler", which is a simplified version that runs on the local computer immediately. Do not normally use the scheduler universe! If not specified you get the "standard" universe, which has very special requirements.
notification	= never	Don't send us email about how our job is doing. If not specified, Condor will try to send you email if you job completes or has a problem.
executable	= sample-job	Required. The name of the executable we want to run
arguments	= 60	Command line arguments to pass to the executable. If not specified, no arguments will be passed in.
output	= sample-out	Anything written to standard output will be placed into this file. If not specified, output will be discarded.
error	= sample-err	Anything written to standard error will be placed into this file. If not specified, error messages will be discarded.
log	= sample-log	Write notes about how your job is doing to this file. If not specified, you don't get these log messages. This is highly recommended, as the log file can help diagnose problems.
transfer_input_files	= sample-input	These files will be copied to the execute computer and made available to your program. If not specified, no files are copied over.
when_to_transfer_output	= ON_EXIT	Required if transfer_input_files is specified. Tells Condor that when the job is done.
queue		Required. Tells Condor you're done.

So, tell Condor about our job!

$ condor_submit sample-submit
Submitting job(s).
1 job(s) submitted to cluster 3256325.

You can check the status of your jobs with condor_q. Be sure to replace adesmet with your own username. If you omit the username, you'll see everyone's jobs. This might take a few seconds; the computer you are using is a busy submit node.

$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256325.0   adesmet         7/23 23:15   0+00:00:08 R  0   0.0  sample-job        

1 jobs; 1 idle, 0 running, 0 held

The "ST" or "State" of "R" indicates your job is running. Other states include:

Short Long Explanation

I Idle The job is waiting for an available computer to run on

R Running The job is running

C Completed The job is done

Short	Long	Explanation
I	Idle	The job is waiting for an available computer to run on
R	Running	The job is running
C	Completed	The job is done

When you job finishes, it will disappear from the list:

$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

It's not gone, it's just moved to the history. The "cluster" or "ID" you saw above listed for your job can be used to ask about the old job:

$ condor_history 3256325


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
3256325.0   adesmet         7/24 20:18   0+00:01:00 C   7/24 20:19 /home/adesmet/c

As you can see, your job is "C", completed and it ran for 1 minute.

By the way, the 3256325? That indicates that this computer has had 3,256,325 Condor jobs submitted since Condor was installed. This computer has managed a lot of work!

You can examine sample-output and sample-out to see the output. You can examine sample-log to see the progress your job made.

$ cat sample-out
Sat Jul 24 20:18:18 CDT 2010
fnpcsrv1.fnal.gov
Sleeping for 60 seconds
Done
$ cat sample-output
This is a test file
$ cat sample-log
000 (3256539.000.000) 07/24 20:18:18 Job submitted from host: <131.225.167.44:65528>
...
001 (3256539.000.000) 07/24 20:18:18 Job executing on host: <131.225.167.44:65528>
...
005 (3256539.000.000) 07/24 20:19:18 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

Clean up our output files before going on:

$ rm sample-out sample-output sample-log sample-err

Moving your job to the Grid

We can submit the same job to the grid with a few simple changes. We'll need a new submit file with a few changes. The changes have been marked below:

$ cat > sample-submit-grid
universe     = grid
grid_resource = gt2 fgitbgkc2.fnal.gov/jobmanager-fork
notification = never
executable   = sample-job
arguments    = 60
output       = sample-out
error        = sample-err
log          = sample-log
transfer_input_files = sample-input
when_to_transfer_output = ON_EXIT
transfer_output_files = sample-output
queue
[Ctrl+D]

What did we change?

Command Description

universe = grid This tells Condor we want to submit to another batch system, probably on a different computer. If specified, grid_resource becomes mandatory.

grid_resource = gt2 fgitbgkc2.fnal.gov/jobmanager-fork Required if the universe=grid. Specifies that we want to hand our job off to a Globus Toolkit version 2 gatekeeper (gt2). The gatekeeper is known as "fgitbgkc2.fnal.gov/jobmanager-fork". We are using jobmanager-fork to speed things up for this hands-on. You would typically use a different name assigned by the Globus site's maintainer. For example, "fgitbgkc2.fnal.gov/jobmanager-condor" is where jobs would normally go on fgitbgk2. Do not use jobmanager-fork for real work!

transfer_output_files = sample-ouput Normally Condor will identify any files created by your job and will bring them back. For the Grid universe, this is no longer true. You must explicitly list any files you want back when your job finishes. Be warned, if you specify a file and it's not there, your job will fail!

Command	Description
universe	= grid	This tells Condor we want to submit to another batch system, probably on a different computer. If specified, grid_resource becomes mandatory.
grid_resource	= gt2 fgitbgkc2.fnal.gov/jobmanager-fork	Required if the universe=grid. Specifies that we want to hand our job off to a Globus Toolkit version 2 gatekeeper (gt2). The gatekeeper is known as "fgitbgkc2.fnal.gov/jobmanager-fork". We are using jobmanager-fork to speed things up for this hands-on. You would typically use a different name assigned by the Globus site's maintainer. For example, "fgitbgkc2.fnal.gov/jobmanager-condor" is where jobs would normally go on fgitbgk2. Do not use jobmanager-fork for real work!
transfer_output_files	= sample-ouput	Normally Condor will identify any files created by your job and will bring them back. For the Grid universe, this is no longer true. You must explicitly list any files you want back when your job finishes. Be warned, if you specify a file and it's not there, your job will fail!

Submit it, same as before.

$ condor_submit sample-submit-grid
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3256541.

Watch it until it finishes.

$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256541.0   adesmet         7/24 20:46   0+00:00:48 R  0   0.0  sample-job 60     

1 jobs; 0 idle, 1 running, 0 held
-bash-3.00$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256541.0   adesmet         7/24 20:46   0+00:01:16 R  0   0.0  sample-job 60     

1 jobs; 0 idle, 1 running, 0 held
-bash-3.00$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

And check that you got your output:

$ cat sample-out
Sat Jul 24 20:46:57 CDT 2010
fgitbgkc2.fnal.gov
Sleeping for 60 seconds
Done
$ cat sample-output
This is a test file

That's it!

Other Useful Commands

You can resubmit your job and try these other commands on it. Submit a job to experiment on. You might want to edit sample-submit to increase the argument to 600 so you have plenty of time to experiment before the job finishes.

$ condor_submit sample-submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 3256545.

We can pause the job with condor_hold.

$ condor_hold 3256545
Cluster 3256545 held.
$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256545.0   adesmet         7/24 20:56   0+00:00:17 R  0   0.0  sample-job 60     

1 jobs; 0 idle, 1 running, 0 held
$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256545.0   adesmet         7/24 20:56   0+00:00:10 H  0   0.0  sample-job 60     

1 jobs; 0 idle, 0 running, 1 held

Normally condor_hold is nearly instantaneous. However this submit host is very busy and is using a Condor extension called "Quill" to help manage the load. A side effect of Quill is that updates can take a little bit.

Once a job is on hold, you can release it with condor_release.

$ condor_release 3266545
Cluster 3256545 released.
$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256545.0   adesmet         7/24 20:56   0+00:00:10 H  0   0.0  sample-job 60     

1 jobs; 0 idle, 0 running, 1 held
$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3256545.0   adesmet         7/24 20:56   0+00:00:30 R  0   0.0  sample-job 60     

1 jobs; 0 idle, 1 running, 0 held

Finally, you can cancel a job entirely with condor_rm.

$ condor_rm 3266545
Cluster 3256545 has been marked for removal.
$ condor_q adesmet


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
$ condor_history 3266545


-- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
3256545.0   adesmet         7/24 20:56   0+00:00:36 X   ???        /home/adesmet/c

The "X" indicates that the job was removed.