[JDEV] jabberd crash after dnsrv read error

Fri Mar 9 01:58:30 CST 2001

Hmm... I'm wondering if this is related to something I've been trying
to track down. I've been seeing a similar situation while trying to 
get jabberd to deal with SIGHUP properly. I didnt delve as deeply into 
pth's guts, but I did diagnose it down to pth waking up in the scheduler
prematurely, and calling abort(). ie.. the below..
**Pth** SCHEDULER INTERNAL ERROR: no more thread(s) available to schedule!?!?

I am usually seeing mio as the last log entry prior to the signal. I took
a break to work on some other projects.  My gut says that this might likely
be caused by the fork within the threaded environment. are we missing a
pth_atfork_push() call to clean up something in the childs environment?
I could be offbase with this line of thought though. 

Hopefully the fix to one of our problems is the fix to both. I'll be looking
into this more tomorrow.

		Phil.

On Thu, Mar 08, 2001 at 06:39:07PM -0600, David Clissold wrote:
> I am seeing the jabberd server (1.4) occasionally crash.  The most likely
> situation to cause it seems to be a newly created user registering, and then
> exiting the client (or logging out) -- though this does not happen
> consistently.  If users are logged in and sending messages, all appears to
> be fine.
> 
> First, has anyone else experienced this?
> 
> I am trying to aquaint myself with the code and could use a pointer
> in investigating it.  What I have found so far:
> 
> I have rebuilt libpth in debug mode (-DPTH_DEBUG), and am running jabberd
> in debug mode (-D) as well.
> Within the dnsrv module, the process is running in the infinite loop in
> dnsrv_child_main(), around line 140, where it is repeatedly reading from
> a dns_io.  The server runs fine as long as there is not a read error here.
> As soon as we get a read error, this dnsrv child process exits with the
> expectation that the parent process will restart it.  The parent goes through
> the libpth code --- via pth_spawn(), to pth_connect_ev(), then pth_wait(),
> then pth_yield(), which gives a floating point exception calling
> pth_mctx_switch() at line 466 of pth_lib.c.  (This is just a macro that
> calls swapcontext() -- see pth_p.h).
> 
> My first instinct was that I was probably seeing a libpth problem,
> not a jabberd problem.  But a couple of factors are making this
> seem less likely:
>  1) With the libpth debug, I see that we go through this pth_yield() code
>     many, many times without trouble.
>  2) Using the same libpth, but with jabber 1.2, the problem did not
>     occur.  (same server: AIX 4.3.3, Linux clients running Gabber).
>  3) This happens ONLY when there is a dnsrv read error, and it happens
>     EVERY time there is a dnsrv read error (1-to-1 correspondence); it isn't
>     super common... but I wouldn't say it is rare either.
> 
> The last debug entry from the main jabberd process is always from mio.c
> "calling the connect handler for mio object..." in _mio_connect().
> 
> Now, I don't know if I should continue tackling the parent/pth problem,
> or if the problem is really this: why the dnsrv read error in the first
> place?  That is, is it expected as normal that the read will occasionally fail,
> and restarting dnsrv is just a part of normal operation?  Or is it the case
> that the dnsrv read should NOT ever fail, and the restart is just an
> emergency attempt to keep things rolling?  (And if the latter, has anyone
> tested the case of read failure; e.g. break out of the read loop after
> a few dozen reads via a counter and see if things start up OK again)?
> 
> Anybody have any ideas on this?
> 
> David Clissold
> cliss at austin.ibm.com
> 
> _______________________________________________
> jdev mailing list
> jdev at jabber.org
> http://mailman.jabber.org/listinfo/jdev

-- 
Mourn the passing of the Mystic Knights.. but revel in their legacy.