fnpcsrv1.fnal.gov has been configured for these hands-on exercises. You should be able to log in with your normal Fermilab Kerberos credentials. We'll need your Kerberos credential on fnpcsrv1.fnal.gov, so be sure to pass "-f" to kinit to create a forwardable credential.
Example text |
What you see on your screen. |
condor_q |
Something to type in. |
[Ctrl+D] |
Instead of typing these characters in, instead press Ctrl and D together. |
universe = grid |
Marks a change from something seen before for comparison. |
adesmet |
Replace this as documented above the example. |
Before you can do anything, you need an "x509 proxy" to identify yourself to grid resources. Some organizations, like the DOE, might have you register with them and their Certificate Authority will issue you an x509 certificate. But Fermilab maintains their own system that can issue x509 certificates using the Kerberos system.
kx509 kxlist -p source /usr/local/vdt/setup.sh voms-proxy-init -noregen -voms fermilab:/fermilab
First, we'll set up the VDT (Virtual Data Toolkit) tools. This gives us access to the voms-proxy-* tools that we'll use later:
$ source /usr/local/vdt/setup.sh
Note that if you're using csh, tsch, or another csh derived shell, you will instead use this command:
$ source /usr/local/vdt/setup.csh
Now that we're set up, let's see what Keberos credentials we have.
$ klist Ticket cache: /tmp/krb5cc_10389_e15383 Default principal: adesmet@FNAL.GOV Valid starting Expires Service principal 07/23/10 22:19:03 07/25/10 00:08:40 krbtgt/FNAL.GOV@FNAL.GOV renew until 07/30/10 22:08:40 07/23/10 22:19:04 07/25/10 00:08:40 afs@FNAL.GOV renew until 07/30/10 22:08:40
Now we generate an x509 certificate:
$ kx509
And let's check to see what changed:
$ klist Ticket cache: /tmp/krb5cc_10389_e15383 Default principal: adesmet@FNAL.GOV Valid starting Expires Service principal 07/23/10 22:19:03 07/25/10 00:08:40 krbtgt/FNAL.GOV@FNAL.GOV renew until 07/30/10 22:08:40 07/23/10 22:19:04 07/25/10 00:08:40 afs@FNAL.GOV renew until 07/30/10 22:08:40 07/23/10 22:19:19 07/25/10 00:08:40 kca_service/winserver.fnal.gov@FNAL.GOV renew until 07/30/10 22:08:40 07/22/10 22:19:19 07/30/10 22:08:40 kx509/certificate@FNAL.GOV
We can look at the x509 certificate that was generated.
$ kxlist Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet serial=01C05555 hash=e7635e83
voms-proxy-info gives us information about our x509 proxy. At the moment we don't have anything:
$ voms-proxy-info Couldn't find a valid proxy.
We generated an x509 certificate, but it's in your Kerberos cache, which isn't useful to us. Fortunately, kxlist can generate a proxy for us:
$ kxlist -p Service kx509/certificate issuer= /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM subject= /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet serial=01C05555 hash=e7635e83
We can now look at it:
$ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: VOMS extension not found! subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet issuer : /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM identity : /DC=gov/DC=fnal/O=Fermilab/OU=Certificate Authorities/CN=Kerberized CA HSM type : unknown strength : 1024 bits path : /tmp/x509up_u10389 timeleft : 167:48:40
"VOMS extension not found!" That's no good. It doesn't have the VOMS information associated with it. Fortunately we can use this proxy to create a new one with the VOMS information. "-noregen" means "use my existing proxy to make the new one."
$ voms-proxy-init -noregen -voms fermilab:/fermilab Your identity: /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet Contacting voms.fnal.gov:15001 [/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov] "fermilab" Done Creating proxy ............................................................... Done Your proxy is valid until Sat Jul 24 10:20:13 2010
We now have a proxy with the VOMS information attached!
$ voms-proxy-info -all WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot verify AC signature! subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet/CN=proxy issuer : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet identity : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet type : proxy strength : 1024 bits path : /tmp/x509up_u10389 timeleft : 11:59:54 === VO fermilab extension information === VO : fermilab subject : /DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Alan A. De smet/CN=UID:adesmet issuer : /DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/nees/Role=NULL/Capability=NULL timeleft : 11:59:53 uri : voms.fnal.gov:15001
We can use globusrun to test if a Globus gatekeeper is responding and willing to accept our proxy:
$ globusrun -a -r fgitbgkc2.fnal.gov/jobmanager-condor GRAM Authentication test successful
We can use globus-job-run to run executables on the remote server. Note that the executable we want to run must already be on the remote server. globus-job-run will block until the executable finishes, returning the output.
$ globus-job-run fgitbgkc2.fnal.gov/jobmanager-fork /bin/date Fri Jul 23 22:47:09 CDT 2010 $ globus-job-run fgitbgkc2.fnal.gov/jobmanager-fork /bin/hostname fgitbgkc2.fnal.gov
Feel free to try this later, but right now it's not recommended. The final step can take a long time to finish.
First, submit our job. The output is an identifier for our job.
$ globus-job-submit fgitbgkc2.fnal.gov/jobmanager-condor /bin/date https://fgitbgkc2.fnal.gov:47738/29075/1279943334/
Now call globus-job-status every once in a while until it is "DONE". Replace the URL below with the URL returned by globus-job-submit.
$ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ PENDING $ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ PENDING $ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ ACTIVE $ globus-job-status https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ DONE
When it's done, we can fetch the output back:
$ globus-job-get-output https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ Fri Jul 23 22:48:55 CDT 2010
Finally, we should free up the resources we have allocated on the remote server. Note that this step can take a long time!
$ globus-job-clean https://fgitbgkc2.fnal.gov:47738/29075/1279943334/ WARNING: Cleaning a job means: - Kill the job if it still running, and - Remove the cached output on the remote resource Are you sure you want to cleanup the job now (Y/N) ? y Cleanup successful.
This example doesn't send your executable or any data files over, and doesn't retrieve any output files. Doing so adds even more complexity. Condor-G can worry about those details for you!
Make a scratch directory for this work:
$ mkdir condor-test $ cd condor-test
Create a test executable. Note that where is says "[Ctrl+D]", you need to hit Ctrl+D, not type those words.
$ cat > sample-job #! /bin/sh /bin/date /bin/hostname /bin/echo Sleeping for $1 seconds /bin/sleep $1 /bin/cat sample-input > sample-output /bin/echo Done [Ctrl+D] $ chmod a+x sample-job
Create a sample input file.
$ echo This is a test file > sample-input
Verify that it works:
$ ./sample-job 1 Fri Jul 23 22:57:31 CDT 2010 fnpcsrv1.fnal.gov Done $ cat sample-output This is a test file
And clean up:
$ rm sample-output
Describe our job for Condor. The spaces lining everything up are optional.
$ cat > sample-submit universe = scheduler notification = never executable = sample-job arguments = 60 output = sample-out error = sample-err log = sample-log transfer_input_files = sample-input when_to_transfer_output = ON_EXIT queue [Ctrl+D]
What's going on in that submit file?
Command | Description | |
---|---|---|
universe | = scheduler | We specified "universe=scheduler". Normally you will want "universe=vanilla". vanilla jobs will run as normal on your Condor cluster. Depending on how busy your cluster is, that may take a while, Furthermore, for this hands-on, we're not actually connected to a full cluster, so vanilla jobs won't ever run. So we're use "scheduler", which is a simplified version that runs on the local computer immediately. Do not normally use the scheduler universe! If not specified you get the "standard" universe, which has very special requirements. |
notification | = never | Don't send us email about how our job is doing. If not specified, Condor will try to send you email if you job completes or has a problem. |
executable | = sample-job | Required. The name of the executable we want to run |
arguments | = 60 | Command line arguments to pass to the executable. If not specified, no arguments will be passed in. |
output | = sample-out | Anything written to standard output will be placed into this file. If not specified, output will be discarded. |
error | = sample-err | Anything written to standard error will be placed into this file. If not specified, error messages will be discarded. |
log | = sample-log | Write notes about how your job is doing to this file. If not specified, you don't get these log messages. This is highly recommended, as the log file can help diagnose problems. |
transfer_input_files | = sample-input | These files will be copied to the execute computer and made available to your program. If not specified, no files are copied over. |
when_to_transfer_output | = ON_EXIT | Required if transfer_input_files is specified. Tells Condor that when the job is done. |
queue | Required. Tells Condor you're done. |
So, tell Condor about our job!
$ condor_submit sample-submit Submitting job(s). 1 job(s) submitted to cluster 3256325.
You can check the status of your jobs with condor_q. Be sure to replace adesmet with your own username. If you omit the username, you'll see everyone's jobs. This might take a few seconds; the computer you are using is a busy submit node.
$ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256325.0 adesmet 7/23 23:15 0+00:00:08 R 0 0.0 sample-job 1 jobs; 1 idle, 0 running, 0 held
The "ST" or "State" of "R" indicates your job is running. Other states include:
Short | Long | Explanation |
---|---|---|
I | Idle | The job is waiting for an available computer to run on |
R | Running | The job is running |
C | Completed | The job is done |
When you job finishes, it will disappear from the list:
$ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
It's not gone, it's just moved to the history. The "cluster" or "ID" you saw above listed for your job can be used to ask about the old job:
$ condor_history 3256325 -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 3256325.0 adesmet 7/24 20:18 0+00:01:00 C 7/24 20:19 /home/adesmet/c
As you can see, your job is "C", completed and it ran for 1 minute.
By the way, the 3256325? That indicates that this computer has had 3,256,325 Condor jobs submitted since Condor was installed. This computer has managed a lot of work!
You can examine sample-output and sample-out to see the output. You can examine sample-log to see the progress your job made.
$ cat sample-out Sat Jul 24 20:18:18 CDT 2010 fnpcsrv1.fnal.gov Sleeping for 60 seconds Done $ cat sample-output This is a test file $ cat sample-log 000 (3256539.000.000) 07/24 20:18:18 Job submitted from host: <131.225.167.44:65528> ... 001 (3256539.000.000) 07/24 20:18:18 Job executing on host: <131.225.167.44:65528> ... 005 (3256539.000.000) 07/24 20:19:18 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
Clean up our output files before going on:
$ rm sample-out sample-output sample-log sample-err
We can submit the same job to the grid with a few simple changes. We'll need a new submit file with a few changes. The changes have been marked below:
$ cat > sample-submit-grid universe = grid grid_resource = gt2 fgitbgkc2.fnal.gov/jobmanager-fork notification = never executable = sample-job arguments = 60 output = sample-out error = sample-err log = sample-log transfer_input_files = sample-input when_to_transfer_output = ON_EXIT transfer_output_files = sample-output queue [Ctrl+D]
What did we change?
Command | Description | |
---|---|---|
universe | = grid | This tells Condor we want to submit to another batch system, probably on a different computer. If specified, grid_resource becomes mandatory. |
grid_resource | = gt2 fgitbgkc2.fnal.gov/jobmanager-fork | Required if the universe=grid. Specifies that we want to hand our job off to a Globus Toolkit version 2 gatekeeper (gt2). The gatekeeper is known as "fgitbgkc2.fnal.gov/jobmanager-fork". We are using jobmanager-fork to speed things up for this hands-on. You would typically use a different name assigned by the Globus site's maintainer. For example, "fgitbgkc2.fnal.gov/jobmanager-condor" is where jobs would normally go on fgitbgk2. Do not use jobmanager-fork for real work! |
transfer_output_files | = sample-ouput | Normally Condor will identify any files created by your job and will bring them back. For the Grid universe, this is no longer true. You must explicitly list any files you want back when your job finishes. Be warned, if you specify a file and it's not there, your job will fail! |
Submit it, same as before.
$ condor_submit sample-submit-grid Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 3256541.
Watch it until it finishes.
$ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256541.0 adesmet 7/24 20:46 0+00:00:48 R 0 0.0 sample-job 60 1 jobs; 0 idle, 1 running, 0 held -bash-3.00$ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256541.0 adesmet 7/24 20:46 0+00:01:16 R 0 0.0 sample-job 60 1 jobs; 0 idle, 1 running, 0 held -bash-3.00$ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
And check that you got your output:
$ cat sample-out Sat Jul 24 20:46:57 CDT 2010 fgitbgkc2.fnal.gov Sleeping for 60 seconds Done $ cat sample-output This is a test file
That's it!
You can resubmit your job and try these other commands on it. Submit a job to experiment on. You might want to edit sample-submit to increase the argument to 600 so you have plenty of time to experiment before the job finishes.
$ condor_submit sample-submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 3256545.
We can pause the job with condor_hold.
$ condor_hold 3256545 Cluster 3256545 held. $ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256545.0 adesmet 7/24 20:56 0+00:00:17 R 0 0.0 sample-job 60 1 jobs; 0 idle, 1 running, 0 held $ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256545.0 adesmet 7/24 20:56 0+00:00:10 H 0 0.0 sample-job 60 1 jobs; 0 idle, 0 running, 1 held
Normally condor_hold is nearly instantaneous. However this submit host is very busy and is using a Condor extension called "Quill" to help manage the load. A side effect of Quill is that updates can take a little bit.
Once a job is on hold, you can release it with condor_release.
$ condor_release 3266545 Cluster 3256545 released. $ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256545.0 adesmet 7/24 20:56 0+00:00:10 H 0 0.0 sample-job 60 1 jobs; 0 idle, 0 running, 1 held $ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3256545.0 adesmet 7/24 20:56 0+00:00:30 R 0 0.0 sample-job 60 1 jobs; 0 idle, 1 running, 0 held
Finally, you can cancel a job entirely with condor_rm.
$ condor_rm 3266545 Cluster 3256545 has been marked for removal. $ condor_q adesmet -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $ condor_history 3266545 -- Quill: quillone@fnpcsrv1.fnal.gov : <131.225.167.44:5432> : quillone ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 3256545.0 adesmet 7/24 20:56 0+00:00:36 X ??? /home/adesmet/c
The "X" indicates that the job was removed.