Discussion:
Question about http_stub_status
Marcus Bianchy
2008-08-22 11:51:12 UTC
Permalink
Hi,


we are using the "Stub Status Module" now for a couple of days, and
there is a something strange effect with the values from this module:

The first few hours the values for "Active connections" and "Waiting"
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn't matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.


For what are these values standing for? What is happening there? BTW:
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)

Maybe there is one who can explain this behavior?



Thanks in advance,

Marcus
Maxim Dounin
2008-08-22 15:07:10 UTC
Permalink
Hello!
Post by Marcus Bianchy
we are using the "Stub Status Module" now for a couple of days, and
The first few hours the values for "Active connections" and "Waiting"
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn't matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.
JFYI: you should rotate logfiles with SIGUSR1, not with SIGHUP. With
SIGHUP nginx re-configures itself, while with SIGUSR1 it does exactly one
thing needed to rotate logfiles: it reopens logs.
Post by Marcus Bianchy
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)
Maybe there is one who can explain this behavior?
Sudden changes in stub_status numbers usually coresponds to died or
killed nginx worker. Take a look at logs.

Maxim Dounin
Igor Sysoev
2008-08-23 07:04:24 UTC
Permalink
Post by Marcus Bianchy
we are using the "Stub Status Module" now for a couple of days, and
The first few hours the values for "Active connections" and "Waiting"
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn't matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.
nginx has special rotation signal: SIGUSR1.
Post by Marcus Bianchy
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)
Maybe there is one who can explain this behavior?
This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run

grep alert error_log

?
--
Igor Sysoev
http://sysoev.ru/en/
Marcus Bianchy
2008-08-23 11:18:07 UTC
Permalink
Hi Igor,

thank you for the fast answer.
Post by Igor Sysoev
nginx has special rotation signal: SIGUSR1.
Ok, I will fix that. I just called our /etc/init.d script (which does
an SIGHUP) after logrotate. It's a better idea to send a USR1 from
logrotate direct to the master process. Or much better: Chenage the
init-Script to do that.
Post by Igor Sysoev
This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run
grep alert error_log
Well, I can say that no one of our team send's such signals around...
We're observing strange signal 8 (SIGFPE) errors the last time:
A typical "grep/zgrep signal" of our error.logs shows things similar
like this:

############ snip ############
2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on
signal 8
2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on
signal 8
2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8
2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on
signal 8
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on
signal 8
2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on
signal 8
2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received
################## snip #############

The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.

Maybe it's a good idea to allow core dumps to exactly reproduce what
causes these signals?


Marcus Bianchy
Igor Sysoev
2008-08-26 16:25:24 UTC
Permalink
Post by Marcus Bianchy
Post by Igor Sysoev
This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run
grep alert error_log
Well, I can say that no one of our team send's such signals around...
A typical "grep/zgrep signal" of our error.logs shows things similar
############ snip ############
2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on
signal 8
2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on
signal 8
2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8
2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on
signal 8
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on
signal 8
2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on
signal 8
2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received
################## snip #############
The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.
Maybe it's a good idea to allow core dumps to exactly reproduce what
causes these signals?
It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.
--
Igor Sysoev
http://sysoev.ru/en/
Marcus Bianchy
2008-08-26 18:53:47 UTC
Permalink
Post by Igor Sysoev
It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.
Hi Igor,

thanks for the update.

Meanwhile I have 2 coredumps with the FPE. And guess what I've found:

#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;


I will discuss the further steps with my colleagues tomorrow. Maybe
we're trying Maxim's patch or trying nginx 0.7.13.

Marcus Bianchy


<http://dict.leo.org/ende?lp=ende&p=thMx..&search=colleagues>
Igor Sysoev
2008-08-26 19:04:18 UTC
Permalink
Post by Marcus Bianchy
Post by Igor Sysoev
It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.
Hi Igor,
thanks for the update.
#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;
I will discuss the further steps with my colleagues tomorrow. Maybe
we're trying Maxim's patch or trying nginx 0.7.13.
Yes, this is the same bug.
--
Igor Sysoev
http://sysoev.ru/en/
Continue reading on narkive:
Loading...