Daemon Debugging

All platforms

<SUBSYS>_DEBUG_WAIT

When <SUBSYS>_DEBUG_WAIT is true, a daemon will pause on startup, running a loop of bool debug_wait=1;while(debug_wait){sleep(1)};. That way you can attach to the daemon with a debugger and continue the run.

(Prior to late June, 2012, this doesn't work reliably. debug_wait could be optimized out. See #3064 for details and workaround.)

On Windows you might prefer the slightly more convenient <SUBSYS>_WAIT_FOR_DEBUGGER, documented below.

GDB specific issues

clone(), gdb internal-errors

HTCondor uses clone in a way that causes GDB grief. Running a HTCondor daemon under GDB will likely fail with errors similar to:

warning: Can't attach LWP -1: No such process
../../gdb/linux-thread-db.c:389: internal-error: thread_get_info_callback: Assertion `inout->thread_info != NULL' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

The solution is to tell HTCondor to not to use clone, but to fall back on the fork code path. In your HTCondor configuration file, use:

USE_CLONE_TO_CREATE_PROCESSES = false

(This page primarily exists as search bait, so that others hitting this error can quickly find the solution.)

For the adventurous (or those who are debugging the clone calls), one can apply the patch found on #2208

Debug a core file produced by a stripped binary

Say you have a.out.stripped and a.out.with_symbols. And say you have a core file from a.out.stripped that you want to debug, but are being thwarted by the lack of symbols. The trick is you still want to gdb the file that produced the core, but load symbols from the unstripped binary into the text segment of the file you are debugging.

   % gdb a.out.stripped core

... gdb will complain about missing debug info ...

   (gdb) maint info sections

In the output from 'maint info sections', look for .text section. The address that is in the first column is what you want to use in the next command. For instance, one of the output lines from 'maint info sections' will be

    0x00400390->0x00400588 at 0x00000390: .text ALLOC LOAD READONLY CODE HAS_CONTENTS

The address in the first column, 0x00400390 in this example, is used in the next gdb command:

   (gdb) add-symbol-file a.out.with_symbols 0x00400390

Of course, substitute 0x00400390 with the proper text segment address.

That will get your symbols loaded for the main executable, but you still won't have any symbols for libraries that were dynamically loaded, which these days is most of HTCondor.

So, to get those, you need to do a couple things.

First, run gdb on the unstripped version of the .so you would like to load and find the .text segment just like above, and write down the offset (the first column), you'll need it later.

Second, go back to gdb with the stripped binary and the core. If you look at any particular stack frame, you'll see the instruction pointer in the second column:

    #7  0x00007f034f8c1834 in KeyCache::remove ()

Now comes the black-magic portion. Run "maint info sections" again and look at the first column. Find the entry that contains the above instruction pointer, The Price Is Right-style. That is, find the largest value without going over your instruction pointer. It should be marked "ALLOC READONLY CODE".

In the following snippet, if your instruction pointer is 0x00002af760aa32a5, then the address you want is 0x2af76081e000:

    0x3fd0424000->0x3fd0424000 at 0x006ce000: load81 ALLOC READONLY
    0x3fd0623000->0x3fd0625000 at 0x006ce000: load82 ALLOC LOAD HAS_CONTENTS
    0x2af76080e000->0x2af760810000 at 0x006d0000: load83 ALLOC LOAD HAS_CONTENTS
    0x2af76081e000->0x2af76081e000 at 0x006d2000: load84 ALLOC READONLY CODE
    0x2af760b76000->0x2af760b76000 at 0x006d2000: load85 ALLOC READONLY
    0x2af760d75000->0x2af760d88000 at 0x006d2000: load86 ALLOC LOAD HAS_CONTENTS
    0x2af760d88000->0x2af760d8e000 at 0x006e5000: load87 ALLOC LOAD HAS_CONTENTS
    0x2af760d8e000->0x2af760d8e000 at 0x006eb000: load88 ALLOC READONLY CODE

Now, add that address to the offset of the .text segment from the .so file you found above. You can do this right in gdb:

    (gdb) print /x 0x000d58a0 + 0x2af76081e000
    $1 = 0x2af7608f38a0
    (gdb)

Finally, you can add the symbols from the shared library with

    add-symbol-file unstripped.so 0xMagicAddress

You'll have to do this for each shared library (libcondor_util, libclassad, etc.) if you need the symbols for that particular stack frame.

Note that the HTCondor binaries are linked separately for the tarball and native packages (rpm and deb), with slightly different settings. For example, the RUNPATH is set differently. Thus, the binaries from the unstripped tarball may not work properly with a core file generated from an RPM installation.

Windows specific issues

<SUBSYS>_WAIT_FOR_DEBUGGER

In addition to <SUBSYS>_DEBUG_WAIT, on Windows (#ifdef WIN32) you have access to <SUBSYS>_WAIT_FOR_DEBUGGER. It's very similar, but is smart enough to automatically detect when a debugger is attached.

Using the Windows heap verifier

The Windows SDK includes a bunch of Debugging Tools For Windows, one of which is the heap verifier. I can detect heap corruption early and provide a stack trace at or near the corrupting code. To do this you will need to install http://www.microsoft.com/en-us/download/confirmation.aspx?id=8279 to get gflags and cdb/ntsd

The steps are:

  • use gflags to turn on heap checking for the demon
  • replace the daemon with a script that runs the daemon under the debugger
  • start condor. when the daemon starts up, attach to the debugger and hit 'g' to let the daemon begin running.

You an replace the daemon with a batch script like this one, which replaces the condor_starter. save this as c:\condor\bin\debug_condor_starter.bat

@echo off
@setlocal
@if NOT "%1"=="-classad" goto :job
@REM just run the starter normally
%~dp0condor_starter.exe %*
@goto :EOF
:job
@REM echo %* > c:\condor\log\starterargs
set remote=-server tcp:port=5099:6000
set log=-logo c:\condor\log\DStarterLog
::set nobreak=-g
set _NT_SYMBOL_PATH=c:\condor\bin;SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
"C:\Program Files\Debugging Tools for Windows (x64)\cdb.exe" %remote% %log% -x %nobreak% %~dp0condor_starter.exe %*

And add this to your configuration

STARTER = $(BIN)\debug_condor_starter.bat

Startup a command prompt with run-as-administrator and execute:

"C:\Program Files\Debugging Tools for Windows (x64)\gflags" /p /enable condor_starter.exe 
or this, to get full verification at every call (it's a bit slower)
"C:\Program Files\Debugging Tools for Windows (x64)\gflags" /p /enable condor_starter.exe /full

Start HTCondor, once the daemon starts up, run this command

"C:\Program Files\Debugging Tools for Windows (x64)\cdb" -remote tcp:port=5099:6000
This will attach to the cdb instance that is debugging the daemon. Unless you had nobreak set in the script above, the debugger will be broken and the daemon will not yet have started. run the folling commands to run the daemon
.lines
g
when the heap verifier finds a problem the 'k' command will produce a stack trace.