[JDEV] jabberd crash in swapcontext() via _mio_raw_connect()

David Clissold cliss at austin.ibm.com
Mon Mar 12 16:40:26 CST 2001


I've looked at this some more -- enough to determine that the dnsrv
failure is not the fault of dnsrv itself.  The read in dnsrv_child_main()
does not actually fail per se -- the read returns 0 (and errno is 0),
meaning we got an EOF from the writing end --  the parent process.
In fact, if the child dnsrv process sleeps a few seconds and
then prints its ppid before exiting, it shows that it is orphaned.
(If the child really gets a read failure without the parent having died, which
I have simulated, it gets restarted just fine).

This then leads me back to the parent jabberd process -- why does it die?
The last debug entry from the parent is:

Mon Mar 12 14:06:01 2001  mio.c:507 calling the connect handler for mio object 200F11C8

On the following line is a call into a fcn ptr which is _mio_raw_connect(),
from which it calls pth_connect_ev() (file "mio_raw.c":64) and I'm guessing
some data here must be corrupted because it dumps core from here.

I'll try to determine more, but any pointers from jabber or pth experts
would be appreciated.
(Phil -- did you have a chance to look into the pth abort() problem you
were seeing?  Perhaps we are seeing two instances of the same problem.)

Incidentally, my jabber.xml is almost identical to the default (only
change: 'localhost' modified to my server hostname).

David Clissold
cliss at austin.ibm.com

Original problem description:
>I am seeing the jabberd server (1.4) occasionally crash.  The most likely
>situation to cause it seems to be a newly created user registering, and then
>exiting the client (or logging out) -- though this does not happen
>consistently.  If users are logged in and sending messages, all appears to
>be fine.
>
>First, has anyone else experienced this?
>
>I am trying to aquaint myself with the code and could use a pointer
>in investigating it.  What I have found so far:
>
>I have rebuilt libpth in debug mode (-DPTH_DEBUG), and am running jabberd
>in debug mode (-D) as well.
>Within the dnsrv module, the process is running in the infinite loop in
>dnsrv_child_main(), around line 140, where it is repeatedly reading from
>a dns_io.  The server runs fine as long as there is not a read error here.
>As soon as we get a read error, this dnsrv child process exits with the
>expectation that the parent process will restart it.  The parent goes through
>the libpth code --- via pth_spawn(), to pth_connect_ev(), then pth_wait(),
>then pth_yield(), which gives a floating point exception calling
>pth_mctx_switch() at line 466 of pth_lib.c.  (This is just a macro that
>calls swapcontext() -- see pth_p.h).
>
>My first instinct was that I was probably seeing a libpth problem,
>not a jabberd problem.  But a couple of factors are making this
>seem less likely:
> 1) With the libpth debug, I see that we go through this pth_yield() code
>    many, many times without trouble.
> 2) Using the same libpth, but with jabber 1.2, the problem did not
>    occur.  (same server: AIX 4.3.3, Linux clients running Gabber).
> 3) This happens ONLY when there is a dnsrv read error, and it happens
>    EVERY time there is a dnsrv read error (1-to-1 correspondence); it isn't
>    super common... but I wouldn't say it is rare either.
>
>The last debug entry from the main jabberd process is always from mio.c
>"calling the connect handler for mio object..." in _mio_connect().
>
>Now, I don't know if I should continue tackling the parent/pth problem,
>or if the problem is really this: why the dnsrv read error in the first
>place?  That is, is it expected as normal that the read will occasionally fail,
>and restarting dnsrv is just a part of normal operation?  Or is it the case
>that the dnsrv read should NOT ever fail, and the restart is just an
>emergency attempt to keep things rolling?  (And if the latter, has anyone
>tested the case of read failure; e.g. break out of the read loop after
>a few dozen reads via a counter and see if things start up OK again)?
>
>Anybody have any ideas on this?
>




More information about the JDev mailing list