Question about http_stub

Discussion:

Question about http_stub_status

Marcus Bianchy

2008-08-22 11:51:12 UTC

Hi,

we are using the "Stub Status Module" now for a couple of days, and
there is a something strange effect with the values from this module:

The first few hours the values for "Active connections" and "Waiting"
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn't matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.

For what are these values standing for? What is happening there? BTW:
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)

Maybe there is one who can explain this behavior?

Thanks in advance,

Marcus

Maxim Dounin

2008-08-22 15:07:10 UTC

Permalink

Hello!

Post by Marcus Bianchy
we are using the "Stub Status Module" now for a couple of days, and
The first few hours the values for "Active connections" and "Waiting"
were reasonable and low (some few hundred).
After a while, both of these values suddenly increased to ~5000, then
few hours later to ~20000, then to ~30000.
It doesn't matter if we are rotating our logfiles (with SIGHUP): These
values are still high around ~30000.

JFYI: you should rotate logfiles with SIGUSR1, not with SIGHUP. With
SIGHUP nginx re-configures itself, while with SIGUSR1 it does exactly one
thing needed to rotate logfiles: it reopens logs.

Post by Marcus Bianchy
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)
Maybe there is one who can explain this behavior?

Sudden changes in stub_status numbers usually coresponds to died or
killed nginx worker. Take a look at logs.

Maxim Dounin

Igor Sysoev

2008-08-23 07:04:24 UTC

Permalink

nginx has special rotation signal: SIGUSR1.

Post by Marcus Bianchy
netstat et al. says some hundred or few thousand concurrent (i.e.
ESTABLISHED) connections (and some FIN_WAIT etc. too)
Maybe there is one who can explain this behavior?

This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run

grep alert error_log

?

--
Igor Sysoev
http://sysoev.ru/en/

Marcus Bianchy

2008-08-23 11:18:07 UTC

Permalink

Hi Igor,

thank you for the fast answer.

Post by Igor Sysoev
nginx has special rotation signal: SIGUSR1.

Ok, I will fix that. I just called our /etc/init.d script (which does
an SIGHUP) after logrotate. It's a better idea to send a USR1 from
logrotate direct to the master process. Or much better: Chenage the
init-Script to do that.

Post by Igor Sysoev
This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run
grep alert error_log

Well, I can say that no one of our team send's such signals around...
We're observing strange signal 8 (SIGFPE) errors the last time:
A typical "grep/zgrep signal" of our error.logs shows things similar
like this:

############ snip ############
2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on
signal 8
2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on
signal 8
2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8
2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on
signal 8
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on
signal 8
2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on
signal 8
2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received
################## snip #############

The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.

Maybe it's a good idea to allow core dumps to exactly reproduce what
causes these signals?

Marcus Bianchy

Igor Sysoev

2008-08-26 16:25:24 UTC

Permalink

Post by Marcus Bianchy

Post by Igor Sysoev
This means that either someone killed nginx workers using SIGTERM/INT/KILL
or workers exited abnornamally. Could you run
grep alert error_log

Well, I can say that no one of our team send's such signals around...
A typical "grep/zgrep signal" of our error.logs shows things similar
############ snip ############
2008/08/22 10:09:42 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 10:09:42 [alert] 28631#0: worker process 27809 exited on
signal 8
2008/08/22 10:09:42 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 27810 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:06 [alert] 28631#0: worker process 32013 exited on
signal 8
2008/08/22 12:58:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:11 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:11 [alert] 28631#0: worker process 27811 exited on
signal 8
2008/08/22 12:58:11 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:58:20 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:58:20 [alert] 28631#0: worker process 785 exited on signal 8
2008/08/22 12:58:20 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 12:59:36 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 12:59:36 [alert] 28631#0: worker process 1342 exited on
signal 8
2008/08/22 12:59:36 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/22 13:00:06 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/22 13:00:06 [alert] 28631#0: worker process 1343 exited on
signal 8
2008/08/22 13:00:06 [notice] 28631#0: signal 29 (SIGIO) received
2008/08/23 04:02:18 [notice] 28631#0: signal 17 (SIGCHLD) received
2008/08/23 04:02:18 [alert] 28631#0: worker process 1344 exited on
signal 8
2008/08/23 04:02:18 [notice] 28631#0: signal 29 (SIGIO) received
################## snip #############
The logrotate runs at 04:00 in the morning, that would explain the
SIGCHLD/SIGFPE at 04:02:18. But the real problem are the signals at
around 1pm; neither the access.log nor the error.log gives any hint
for the thing that produces this behaviour. And guess: yesterday
at 1pm the values for active/waiting connections increaesed to
~30000/35000.
Maybe it's a good idea to allow core dumps to exactly reproduce what
causes these signals?

It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.

--
Igor Sysoev
http://sysoev.ru/en/

Marcus Bianchy

2008-08-26 18:53:47 UTC

Permalink

Post by Igor Sysoev
It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.

Hi Igor,

thanks for the update.

Meanwhile I have 2 coredumps with the FPE. And guess what I've found:

#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;

I will discuss the further steps with my colleagues tomorrow. Maybe
we're trying Maxim's patch or trying nginx 0.7.13.

Marcus Bianchy

<http://dict.leo.org/ende?lp=ende&p=thMx..&search=colleagues>

Igor Sysoev

2008-08-26 19:04:18 UTC

Permalink

Post by Marcus Bianchy

Post by Igor Sysoev
It seems that you have "max_fails=0" in some upstream.
The recent Maxim's patch fixes the bug or you may try nginx-0.7.12.

Hi Igor,
thanks for the update.
#0 0x08079df3 in ngx_http_upstream_free_round_robin_peer (pc=0x9da9cd8,
data=0x9daa3a0,
state=4) at src/http/ngx_http_upstream_round_robin.c:648
648 peer->current_weight -= peer->weight / peer->max_fails;
I will discuss the further steps with my colleagues tomorrow. Maybe
we're trying Maxim's patch or trying nginx 0.7.13.

Yes, this is the same bug.

--
Igor Sysoev
http://sysoev.ru/en/

Continue reading on narkive:

Search results for 'Question about http_stub_status' (Questions and Answers)

replies

Marital status on Pay stub showing S even after married - Problem?

started 2009-04-28 19:31:38 UTC

united states

replies

Question on Green Card?

started 2011-02-02 06:38:19 UTC

immigration