Skip to content

Troubleshooting Remote HTCondor-CEs

Since HTCondor-CE is built on top of HTCondor, it's possible to perform quite a bit of troubleshooting from a remote client with access to the HTCondor command-line tools. For testing end-to-end resource request submission or remote interactive-like access, a grid operator will need access to a host with the htcondor-ce-client installed. This document outlines the steps that a grid operator can perform in order to troubleshoot a remote HTCondor-CE.

Verifying Network Connectivity

Before performing any troubleshooting of the remote HTCondor-CE service, it's important to verify that the HTCondor-CE can be contacted on its HTCondor-CE port (default: 9619) at the specified fully qualified domain name (FQDN).

Verifying DNS

As noted in the HTCondor-CE installation document, an HTCondor-CE must have forward and reverse DNS records. To verify DNS, use a tool like nslookup:

$ nslookup htcondor-ce.chtc.wisc.edu
Server:     144.92.254.254
Address:    144.92.254.254#53

Non-authoritative answer:
Name:   htcondor-ce.chtc.wisc.edu
Address: 128.104.100.65
Name:   htcondor-ce.chtc.wisc.edu
Address: 2607:f388:107c:501:216:3eff:fe89:aa3

$ nslookup 128.104.100.65
65.100.104.128.in-addr.arpa name = htcondor-ce.chtc.wisc.edu.

Authoritative answers can be found from:
104.128.in-addr.arpa    nameserver = dns2.itd.umich.edu.
104.128.in-addr.arpa    nameserver = adns1.doit.wisc.edu.
104.128.in-addr.arpa    nameserver = adns3.doit.wisc.edu.
104.128.in-addr.arpa    nameserver = adns2.doit.wisc.edu.
adns2.doit.wisc.edu internet address = 144.92.20.99
dns2.itd.umich.edu  internet address = 192.12.80.222
adns3.doit.wisc.edu internet address = 144.92.104.21
adns1.doit.wisc.edu internet address = 144.92.9.21
adns2.doit.wisc.edu has AAAA address 2607:f388:d:2::1006
adns2.doit.wisc.edu has AAAA address 2607:f388::a53:2
adns3.doit.wisc.edu has AAAA address 2607:f388:2:2001::100b
adns3.doit.wisc.edu has AAAA address 2607:f388::a53:3
adns1.doit.wisc.edu has AAAA address 2607:f388:2:2001::100a
adns1.doit.wisc.edu has AAAA address 2607:f388::a53:1

If not, the HTCondor-CE administrator will have to register the appropriate DNS records.

Verifying service connectivity

After verifying DNS, check to see if the remote HTCondor-CE is listening on the appropriate port:

$ telnet htcondor-ce.chtc.wisc.edu 9619
Trying 128.104.100.65...
Connected to htcondor-ce.chtc.wisc.edu.
Escape character is '^]'.

If not, the HTCondor-CE administrator will have to ensure that the service is running and/or open up their firewall.

Verifying Configuration

Once you've verified network connectivity, you can start verifying the HTCondor-CE daemons.

Inspecting daemons

To inspect the running daemons of a remote HTCondor-CE, use condor_status:

$ condor_status -any -pool htcondor-ce.chtc.wisc.edu:9619
MyType             TargetType         Name

Collector          None               My Pool - htcondor-ce.chtc.wisc.edu@htcondor-ce.c
Job_Router         None               htcondor-ce@htcondor-ce.chtc.wisc.edu
Scheduler          None               htcondor-ce.chtc.wisc.edu
DaemonMaster       None               htcondor-ce.chtc.wisc.edu
Submitter          None               nu_lhcb@users.htcondor.org

If you don't see the appropriate daemons, ask the administrator to following these troubleshooting steps.

Submitter ad

When querying daemons for an HTCondor-CE, you may see Submitter ads for each user with jobs in the queue. These ads are used to collect per-user stats that are available to the HTCondor-CE administrator.

You can inspect the details of a specific daemon with per-daemon and -long options:

$ condor_status -pool htcondor-ce.chtc.wisc.edu:9619 -collector -long
ActiveQueryWorkers = 0
ActiveQueryWorkersPeak = 1
AddressV1 = "{[ p=\"primary\"; a=\"128.104.100.65\"; port=9619; n=\"Internet\"; alias=\"htcondor-ce.chtc.wisc.edu\"; spid=\"collector\"; noUDP=true; ], [ p=\"IPv4\"; a=\"128.104.100.65\"; port=9619; n=\"Internet\"; alias=\"htcondor-ce.chtc.wisc.edu\"; spid=\"collector\"; noUDP=true; ], [ p=\"IPv6\"; a=\"2607:f388:107c:501:216:3eff:fe89:aa3\"; port=9619; n=\"Internet\"; alias=\"htcondor-ce.chtc.wisc.edu\"; spid=\"collector\"; noUDP=true; ]}"
CollectorIpAddr = "<128.104.100.65:9619?addrs=128.104.100.65-9619+[2607-f388-107c-501-216-3eff-fe89-aa3]-9619&alias=htcondor-ce.chtc.wisc.edu&noUDP&sock=collector>"
CondorAdmin = "root@htcondor-ce.chtc.wisc.edu"
CondorPlatform = "$CondorPlatform: x86_64_CentOS7 $"
CondorVersion = "$CondorVersion: 8.9.8 Jun 29 2020 BuildID: 508520 PackageID: 8.9.8-0.508520 $"
CurrentJobsRunningAll = 0
CurrentJobsRunningGrid = 0
CurrentJobsRunningJava = 0

Inspecting resource requests

To inspect resource requests submitted to a remote HTCondor-CE, use condor_q:

$ condor_q -all -name htcondor-ce.chtc.wisc.edu -pool htcondor-ce.chtc.wisc.edu:9619

-- Schedd: htcondor-ce.chtc.wisc.edu : <128.104.100.65:9619?... @ 10/29/20 15:31:45
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
nu_lhcb ID: 24631   10/18 12:06     33      _      _      _     35 24631.5-15
nu_lhcb ID: 24632   10/18 12:23      7      _      _      _      9 24632.2-4
nu_lhcb ID: 24635   10/18 14:23      3      _      _      _      5 24635.0-1
nu_lhcb ID: 24636   10/18 14:40      5      _      _      _      9 24636.0-6
nu_lhcb ID: 24637   10/18 14:58      7      _      _      _      8 24637.2
nu_lhcb ID: 24638   10/18 15:15      7      _      _      _      8 24638.1

You can inspect the details of a specific resource request with the -long option:

$ condor_q -all -name htcondor-ce.chtc.wisc.edu -pool htcondor-ce.chtc.wisc.edu:9619 -long 24631.5
Arguments = ""
BufferBlockSize = 32768
BufferSize = 524288
BytesRecvd = 58669.0
BytesSent = 184201.0
ClusterId = 24631
Cmd = "DIRAC_Ullq7V_pilotwrapper.py"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0

Retrieving HTCondor-CE configuration

To verify a remote HTCondor-CE configuration, use condor_config_val:

$ condor_config_val -name htcondor-ce.chtc.wisc.edu -pool htcondor-ce.chtc.wisc.edu:9619 -dump
# Configuration from master on htcondor-ce.chtc.wisc.edu <128.104.100.65:9619?addrs=128.104.100.65-9619+[2607-f388-107c-501-216-3eff-fe89-aa3]-9619&alias=htcondor-ce.chtc.wisc.edu&noUDP&sock=master_1744571_b0a0>
ABORT_ON_EXCEPTION = false
ACCOUNTANT_HOST = 
ACCOUNTANT_LOCAL_DOMAIN = 
ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (time() - JobStart), 0)
ActivityTimer = (time() - EnteredCurrentActivity)
ADD_WINDOWS_FIREWALL_EXCEPTION = $(CondorIsAdmin)
ADVERTISE_IPV4_FIRST = $(PREFER_IPV4)
ALL_DEBUG = D:CAT D_ALWAYS:2
ALLOW_ADMIN_COMMANDS = true

If you know the name of the configuration variable, you can query for it directly:

$ condor_config_val -name htcondor-ce.chtc.wisc.edu -pool htcondor-ce.chtc.wisc.edu:9619 -verbose JOB_ROUTER_SCHEDD2_NAME
JOB_ROUTER_SCHEDD2_NAME = htcondor-ce.chtc.wisc.edu
 # at: /etc/condor-ce/config.d/02-ce-condor.conf, line 20
 # raw: JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME)

Verifying Resource Request Submission

After verifying that all the remote HTCondor-CE daemons are up, you can start submitting resource requests!

Verifying authentication

Before submitting a successful resource request, you will want to verify that you have submit privileges. For this, you will need a credential such as a grid proxy:

$ voms-proxy-info
subject   : /DC=org/DC=cilogon/C=US/O=University of Wisconsin-Madison/CN=Brian Lin A2266246/CN=41319870
issuer    : /DC=org/DC=cilogon/C=US/O=University of Wisconsin-Madison/CN=Brian Lin A2266246
identity  : /DC=org/DC=cilogon/C=US/O=University of Wisconsin-Madison/CN=Brian Lin A2266246
type      : RFC compliant proxy
strength  : 1024 bits
path      : /tmp/x509up_u1000
timeleft  : 3:55:22

After you have retrieved your credential, verify that you have the ability to submit requests to the remote HTCondor-CE (i.e., WRITE access) with condor_ping:

$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS,GSI
$ export _condor_SEC_TOOL_DEBUG=D_SECURITY:2  # Extremely verbose debugging for troubleshooting authentication issues
$ condor_ping -name htcondor-ce.chtc.wisc.edu \
              -pool htcondor-ce.chtc.wisc.edu:9619 \
              -verbose \
              -debug \
               WRITE
[...]
Remote Version:              $CondorVersion: 8.9.8 Jun 29 2020 BuildID: 508520 PackageID: 8.9.8-0.508520 $
Local  Version:              $CondorVersion: 8.9.9 Aug 26 2020 BuildID: 515894 PackageID: 8.9.9-0.515894 PRE-RELEASE-UWCS $
Session ID:                  htcondor-ce:2980451:1604006441:0
Instruction:                 WRITE
Command:                     60021
Encryption:                  none
Integrity:                   MD5
Authenticated using:         GSI
All authentication methods:  FS,GSI
Remote Mapping:              blin@users.htcondor.org
Authorized:                  TRUE

If condor_ping fails, ask the administrator to follow this troubleshooting section, set SCHEDD_DEBUG = $(SCHEDD_DEBUG) D_SECURITY:2, and cross-check the HTCondor-CE SchedLog for authentication issues.

Submitting a trace request

HTCondor-CE client

This section requires an installation of the htcondor-ce-client.

The easiest way to troubleshoot end-to-end resource request submission is condor_ce_trace, available in the htcondor-ce-client. Follow this documentation for detailed instructions for installing and using condor_ce_trace.

Advanced Troubleshooting

HTCondor-CE client

This section requires an installation of the htcondor-ce-client.

If the issue at hand is complicated or the communication turnaround time with the administrator is too long, it is often more expedient to grant the operator direct access to the HTCondor-CE host. Instead of direct login access, HTCondor-CE has the ability to allow a remote operator to run commands on the host as an unprivileged user. This requires permission to submit resource requests as well as an HTCondor-CE that is configured to run local universe jobs:

$ condor_config_val -name htcondor-ce.chtc.wisc.edu -pool htcondor-ce.chtc.wisc.edu:9619 -verbose START_LOCAL_UNIVERSE
START_LOCAL_UNIVERSE = TotalLocalJobsRunning + TotalSchedulerJobsRunning < 20
 # at: /etc/condor-ce/config.d/03-managed-fork.conf, line 11
 # raw: START_LOCAL_UNIVERSE = TotalLocalJobsRunning + TotalSchedulerJobsRunning < 20
 # default: TotalLocalJobsRunning < 200

After verifying that you can submit resource requests and that the HTCondor-CE supports local universe, use condor_ce_run to run commands on the remote HTCondor-CE host:

$ condor_ce_run -lr htcondor-ce.chtc.wisc.edu /bin/sh -c 'condor_q -all'

-- Schedd: htcondor-ce.chtc.wisc.edu : <128.104.100.65:9618?... @ 10/29/20 17:42:27
OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
nu_lhcb ID: 530251   7/13 22:20      _      _      _      _      1 530251.0
nu_lhcb ID: 586158  10/28 05:15      _      _      _      1      1 586158.0
nu_lhcb ID: 586213  10/28 06:23      _      1      _      _      1 586213.0
nu_lhcb ID: 586228  10/28 06:23      _      1      _      _      1 586228.0
nu_lhcb ID: 586254  10/28 06:23      _      1      _      _      1 586254.0
nu_lhcb ID: 586289  10/28 07:58      _      _      _      1      1 586289.0

Getting Help

If you have any questions or issues about troubleshooting remote HTCondor-CEs, please contact us for assistance.