Debugging Tools¶
HTCondor-CE has its own separate set of of the HTCondor tools with ce in the name (i.e., condor_ce_submit
vs
condor_submit
).
Some of the the commands are only for the CE (e.g., condor_ce_run
and condor_ce_trace
) but many of them are just
HTCondor commands configured to interact with the CE (e.g., condor_ce_q
, condor_ce_status
).
It is important to differentiate the two: condor_ce_config_val
will provide configuration values for your HTCondor-CE
while
condor_config_val
will provide configuration values for your HTCondor batch system.
If you are not running an HTCondor batch system, the non-CE commands will return errors.
condor_ce_trace¶
Usage¶
condor_ce_trace
is a useful tool for testing end-to-end job submission.
It contacts both the CE’s Schedd and Collector daemons to verify your permission to submit to the CE, displays the
submit script that it submits to the CE, and tracks the resultant job.
Note
You must have generated a proxy (e.g., voms-proxy-init
) and your DN must be added to your
chosen authentication method.
user@host $ condor_ce_trace condorce.example.com
Replacing the condorce.example.com
with the hostname of the CE.
If you are familiar with the output of condor commands, the command also takes a --debug
option that displays verbose
condor output.
Troubleshooting¶
- If the command fails with “Failed ping…”: Make sure that the HTCondor-CE daemons are running on the CE
- If you see “gsi@unmapped” in the “Remote Mapping” line: Either your credentials are not mapped on the CE or authentication is not set up at all. To set up authentication, refer to our configuration document.
- If the job submits but does not complete: Look at the status of the job and perform the relevant troubleshooting steps.
condor_ce_host_network_check¶
Usage¶
condor_ce_host_network_check
is a tool for testing an HTCondor-CE's networking configuration:
root@host # condor_ce_host_network_check
Starting analysis of host networking for HTCondor-CE
System hostname: fermicloud360.fnal.gov
FQDN matches hostname
Forward resolution of hostname fermicloud360.fnal.gov is 131.225.155.96.
Backward resolution of IPv4 131.225.155.96 is fermicloud360.fnal.gov.
Forward and backward resolution match!
HTCondor is considering all network interfaces and addresses.
HTCondor would pick address of 131.225.155.96 as primary address.
HTCondor primary address 131.225.155.96 matches system preferred address.
Host network configuration should work with HTCondor-CE
Troubleshooting¶
If the tool reports that Host network configuration not expected to work with HTCondor-CE
, ensure that forward and
reverse DNS resolution return the public IP and hostname.
condor_ce_run¶
Usage¶
condor_ce_run
is a tool that submits a simple job to your CE, so it is useful for quickly
submitting jobs through your CE.
To submit a job to the CE and run the env
command on the remote batch system:
Note
You must have generated a proxy (e.g., voms-proxy-init
) and your DN must be added to your
chosen authentication method.
user@host $ condor_ce_run -r condorce.example.com:9619 /bin/env
Replacing the condorce.example.com
with the hostname of the CE.
If you are troubleshooting an HTCondor-CE that you do not have a login for and the CE accepts local universe jobs, you
can run commands locally on the CE with condor_ce_run
with the -l
option.
The following example outputs the JobRouterLog of the CE in question:
user@host $ condor_ce_run -lr condorce.example.com:9619 cat /var/log/condor-ce/JobRouterLog
Replacing the condorce.example.com
text with the hostname of the CE.
To disable this feature on your CE, consult
this section of the install
documentation.
Troubleshooting¶
- If you do not see any results:
condor_ce_run
does not display results until the job completes on the CE, which may take several minutes or longer if the CE is busy. In the meantime, can use condor_ce_q in a separate terminal to track the job on the CE. If you never see any results, use condor_ce_trace to pinpoint errors. - If you see an error message that begins with “Failed to…”: Check connectivity to the CE with condor_ce_trace or condor_ce_ping
condor_ce_submit¶
See this documentation for details
condor_ce_ping¶
Usage¶
Use the following condor_ce_ping
command to test your ability to submit jobs to an HTCondor-CE, replacing
condorce.example.com
with the hostname of your CE:
user@host $ condor_ce_ping -verbose -name condorce.example.com -pool condorce.example.com:9619 WRITE
The following shows successful output where I am able to submit jobs (Authorized: TRUE
) as the glow user (Remote
Mapping: glow@users.opensciencegrid.org
):
Remote Version: $CondorVersion: 8.0.7 Sep 24 2014 $
Local Version: $CondorVersion: 8.0.7 Sep 24 2014 $
Session ID: condorce:27407:1412286981:3
Instruction: WRITE
Command: 60021
Encryption: none
Integrity: MD5
Authenticated using: GSI
All authentication methods: GSI
Remote Mapping: glow@users.opensciencegrid.org
Authorized: TRUE
Note
If you run the condor_ce_ping
command on the CE that you are testing, omit the -name
and -pool
options. condor_ce_ping
takes the same arguments as
condor_ping
, which is documented in the
HTCondor manual.
Troubleshooting¶
-
If you see “ERROR: couldn’t locate (null)”, that means the HTCondor-CE schedd (the daemon that schedules jobs) cannot be reached. To track down the issue, increase debugging levels on the CE:
MASTER_DEBUG = D_ALWAYS:2 D_CAT SCHEDD_DEBUG = D_ALWAYS:2 D_CAT
-
If you see “gsi@unmapped” in the “Remote Mapping” line, this means that either your credentials are not mapped on the CE or that authentication is not set up at all. To set up authentication, refer to our configuration document.
condor_ce_q¶
Usage¶
condor_ce_q
can display job status or specific job attributes for jobs that are still in the CE’s queue.
To list jobs that are queued on a CE:
user@host $ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619
To inspect the full ClassAd for a specific job, specify the -l
flag and the job ID:
user@host $ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619 -l <JOB-ID>
Note
If you run the condor_ce_q
command on the CE that you are testing, omit the -name
and -pool
options.
condor_ce_q
takes the same arguments as condor_q
, which is documented in the
HTCondor manual.
Troubleshooting¶
If the jobs that you are submiting to a CE are not completing, condor_ce_q
can tell you the status of your jobs.
-
If the schedd is not running: You will see a lengthy message about being unable to contact the schedd. To track down the issue, increase the debugging levels on the CE with:
MASTER_DEBUG = D_ALWAYS:2 D_CAT SCHEDD_DEBUG = D_ALWAYS:2 D_CAT
To apply these changes, reconfigure HTCondor-CE:
root@host # condor_ce_reconfig
Then look in the MasterLog and SchedLog on the CE for any errors.
-
If there are issues with contacting the collector: You will see the following message:
user@host $ condor_ce_q -pool ce1.accre.vanderbilt.edu -name ce1.accre.vanderbilt.edu -- Failed to fetch ads from: <129.59.197.223:9620?sock`33630_8b33_4> : ce1.accre.vanderbilt.edu
This may be due to network issues or bad HTCondor daemon permissions. To fix the latter issue, ensure that the
ALLOW_READ
configuration value is not set:user@host $ condor_ce_config_val -v ALLOW_READ Not defined: ALLOW_READ
If it is defined, remove it from the file that is returned in the output.
-
If a job is held: There should be an accompanying
HoldReason
that will tell you why it is being held. TheHoldReason
is in the job’s ClassAd, so you can use the long form ofcondor_ce_q
to extract its value:user@host $ condor_ce_q -name condorce.example.com -pool condorce.example.com:9619 -l <Job ID> | grep HoldReason
-
If a job is idle: The most common cause is that it is not matching any routes in the CE’s job router. To find out whether this is the case, use the condor_ce_job_router_info.
condor_ce_history¶
Usage¶
condor_ce_history
can display job status or specific job attributes for jobs that have that have left the CE’s queue.
To list jobs that have run on the CE:
user@host $ condor_ce_history -name condorce.example.com -pool condorce.example.com:9619
To inspect the full ClassAd for a specific job, specify the -l
flag and the job ID:
user@host $ condor_ce_history -name condorce.example.com -pool condorce.example.com:9619 -l <Job ID>
Note
If you run the condor_ce_history
command on the CE that you are testing, omit the -name
and -pool
options.
condor_ce_history
takes the same arguments as condor_history
, which is documented in the
HTCondor manual.
condor_ce_job_router_info¶
Usage¶
Use the condor_ce_job_router_info
command to help troubleshoot your routes and how jobs will match to them.
To see all of your routes:
root@host # condor_ce_job_router_info -config
To see how the job router is handling a job that is currently in the CE’s queue, analyze the output of condor_ce_q
(replace the <JOB-ID>
with the job ID that you are interested in):
root@host # condor_ce_q -l <JOB-ID> | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
To inspect a job that has already left the queue, use condor_ce_history
instead of condor_ce_q
:
root@host # condor_ce_history -l <JOB-ID> | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
Note
If the proxy for the job has expired, the job will not match any routes. To work around this constraint:
root@host # condor_ce_history -l <JOB-ID> | sed "s/^\(x509UserProxyExpiration\) = .*/\1 = `date +%s --date '+1 sec'`/" | condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads -
Alternatively, you can provide a file containing a job’s ClassAd as the input and edit attributes within that file:
root@host # condor_ce_job_router_info -match-jobs -ignore-prior-routing -jobads <JOBAD-FILE>
Troubleshooting¶
- If the job does not match any route: You can identify this case when you see
0 candidate jobs found
in thecondor_job_router_info
output. This message means that, when compared to your job’s ClassAd, the Umbrella constraint does not evaluate totrue
. When troubleshooting, look at all of the expressions prior to thetarget.ProcId >= 0
expression, because it and everything following it is logic that the job router added so that routed jobs do not get routed again. -
If your job matches more than one route: the tool will tell you by showing all matching routes after the job ID:
Checking Job src=162,0 against all routes Route Matches: Local_PBS Route Matches: Condor_Test
To troubleshoot why this is occuring, look at the combined Requirements expressions for all routes and compare it to the job’s ClassAd provided. The combined Requirements expression is highlighted below:
Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && (target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target.x509UserProxyExpiration) && (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && ( (target.osgTestPBS is true) || (true) ) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "Extenal" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce")
Both routes evaluate to
true
for the job’s ClassAd because it containedosgTestPBS = true
. Make sure your routes are mutually exclusive, otherwise you may have jobs routed incorrectly! See the job route configuration page for more details. -
If it is unclear why jobs are matching a route: wrap the route's requirements expression in debug() and check the JobRouterLog for more information.
condor_ce_router_q¶
Usage¶
If you have multiple job routes and many jobs, condor_ce_router_q
is a useful tool to see how jobs are being routed
and their statuses:
user@host $ condor_ce_router_q
condor_ce_router_q
takes the same options as condor_router_q
,
which is documented in the
HTCondor manual
condor_ce_test_token¶
Usage¶
Use the condor_ce_test_token
command to test SciTokens
authentication in the CE.
It will create a token with an issuer and subject that you specify and
configure the CE daemons to accept that token as if it had been
generated by the given issuer (for one hour).
The token is printed to stdout; use it with condor_ce_submit
to test
that SciTokens authentication and user mapping operate correctly.
To create a temporary SciToken that appears to be issued by the SciTokens demo issuer:
root@host # condor_ce_token_test --issuer https://demo.scitokens.org
--audience ANY --scope condor:/WRITE --subject alice@foo.edu
Note
You must run condor_ce_test_token
on the CE that you are testing
as the root user.
condor_ce_test_token
takes the same arguments as
condor_test_token
, which is documented in the
HTCondor manual.
condor_ce_status¶
Usage¶
To see the daemons running on a CE, run the following command:
user@host $ condor_ce_status -any
condor_ce_status
takes the same arguments as condor_status
, which is documented in the
HTCondor manual.
"Missing" Worker Nodes
An HTCondor-CE will not show any worker nodes (e.g. Machine
entries in the condor_ce_status -any
output) if
it does not have any running GlideinWMS pilot jobs.
This is expected since HTCondor-CE only forwards incoming pilot jobs to your batch system and does not match jobs to
worker nodes.
Troubleshooting¶
If the output of condor_ce_status -any
does not show at least the following daemons:
- Collector
- Scheduler
- DaemonMaster
- Job_Router
Increase the debug level and consult the HTCondor-CE logs for errors.
condor_ce_config_val¶
Usage¶
To see the value of configuration variables and where they are set, use condor_ce_config_val
.
Primarily, This tool is used with the other troubleshooting tools to make sure your configuration is set properly.
To see the value of a single variable and where it is set:
user@host $ condor_ce_config_val -v <CONFIGURATION-VARIABLE>
To see a list of all configuration variables and their values:
user@host $ condor_ce_config_val -dump
To see a list of all the files that are used to create your configuration and the order that they are parsed, use the following command:
user@host $ condor_ce_config_val -config
condor_ce_config_val
takes the same arguments as
condor_config_val
, which is documented in the
HTCondor manual.
condor_ce_reconfig¶
Usage¶
To ensure that your configuration changes have taken effect, run condor_ce_reconfig
.
user@host $ condor_ce_reconfig
condor_ce_{on,off,restart}¶
Usage¶
To turn on/off/restart HTCondor-CE daemons, use the following commands:
root@host # condor_ce_on
root@host # condor_ce_off
root@host # condor_ce_restart
The HTCondor-CE service uses the previous commands with default values. Using these commands directly gives you more fine-grained control over the behavior of HTCondor-CE's on/off/restart:
-
If you have installed a new version of HTCondor-CE and want to restart the CE under the new version, run the following command:
root@host # condor_ce_restart -fast
This will cause HTCondor-CE to restart and quickly reconnect to all running jobs.
-
If you need to stop running new jobs, run the following:
root@host # condor_ce_off -peaceful
This will cause HTCondor-CE to accept new jobs without starting them and will wait for currently running jobs to complete before shutting down.
Getting Help¶
If you have any questions or issues about troubleshooting remote HTCondor-CEs, please contact us for assistance.