[JDEV] jabberd crash after dnsrv read error

David Clissold cliss at austin.ibm.com
Thu Mar 8 18:39:07 CST 2001


I am seeing the jabberd server (1.4) occasionally crash.  The most likely
situation to cause it seems to be a newly created user registering, and then
exiting the client (or logging out) -- though this does not happen
consistently.  If users are logged in and sending messages, all appears to
be fine.

First, has anyone else experienced this?

I am trying to aquaint myself with the code and could use a pointer
in investigating it.  What I have found so far:

I have rebuilt libpth in debug mode (-DPTH_DEBUG), and am running jabberd
in debug mode (-D) as well.
Within the dnsrv module, the process is running in the infinite loop in
dnsrv_child_main(), around line 140, where it is repeatedly reading from
a dns_io.  The server runs fine as long as there is not a read error here.
As soon as we get a read error, this dnsrv child process exits with the
expectation that the parent process will restart it.  The parent goes through
the libpth code --- via pth_spawn(), to pth_connect_ev(), then pth_wait(),
then pth_yield(), which gives a floating point exception calling
pth_mctx_switch() at line 466 of pth_lib.c.  (This is just a macro that
calls swapcontext() -- see pth_p.h).

My first instinct was that I was probably seeing a libpth problem,
not a jabberd problem.  But a couple of factors are making this
seem less likely:
 1) With the libpth debug, I see that we go through this pth_yield() code
    many, many times without trouble.
 2) Using the same libpth, but with jabber 1.2, the problem did not
    occur.  (same server: AIX 4.3.3, Linux clients running Gabber).
 3) This happens ONLY when there is a dnsrv read error, and it happens
    EVERY time there is a dnsrv read error (1-to-1 correspondence); it isn't
    super common... but I wouldn't say it is rare either.

The last debug entry from the main jabberd process is always from mio.c
"calling the connect handler for mio object..." in _mio_connect().

Now, I don't know if I should continue tackling the parent/pth problem,
or if the problem is really this: why the dnsrv read error in the first
place?  That is, is it expected as normal that the read will occasionally fail,
and restarting dnsrv is just a part of normal operation?  Or is it the case
that the dnsrv read should NOT ever fail, and the restart is just an
emergency attempt to keep things rolling?  (And if the latter, has anyone
tested the case of read failure; e.g. break out of the read loop after
a few dozen reads via a counter and see if things start up OK again)?

Anybody have any ideas on this?

David Clissold
cliss at austin.ibm.com




More information about the JDev mailing list