Ticket #1119 (closed defect: worksforme)

Opened 2 years ago

Last modified 17 months ago

Varnish 3.0.2 freezed, Pushing vcls failed:#012CLI communication error (hdr)

Reported by: campisano Owned by:
Priority: normal Milestone:
Component: build Version: 3.0.2
Severity: blocker Keywords: child died pushing vcls failed #012CLI communication error (hdr)
Cc:

Description (last modified by phk) (diff)

Hi,

I have a problem with the current version of varnish: two (or three) times it became 'freezed' with this log message:

Mar 30 05:16:56 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:06 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:16 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:26 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:36 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:46 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) died signal=3
Mar 30 05:17:55 macleod varnishd[4423]: Child cleanup complete
Mar 30 05:17:56 macleod varnishd[4423]: child (8349) Started
Mar 30 05:18:06 macleod varnishd[4423]: Pushing vcls failed:#012CLI communication error (hdr)
Mar 30 05:18:06 macleod varnishd[4423]: Stopping Child
Mar 30 05:18:06 macleod varnishd[4423]: Child (8349) said Child starts
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) said SMF.s0 mmap'ed 5368709120 bytes of 5368709120
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) said Child dies
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) died status=1
Mar 30 05:18:18 macleod varnishd[4423]: Child cleanup complete

The main process was still active, but it didn't accept connection on incoming port (simples go in timeout, wasn't refused).

I only have the 'strange' varnishstat -1 log in attachment, how can I increment the 'verbosity' of varnish ?

Thanks in advantage, Riccardo

Attachments

varnishstat.log Download (14.7 KB) - added by campisano 2 years ago.
varnishstat -1 output

Change History

Changed 2 years ago by campisano

varnishstat -1 output

comment:1 Changed 2 years ago by campisano

Sorry :\

Mar 30 05:16:56 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:06 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:16 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:26 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:36 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:46 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) not responding to CLI, killing it.
Mar 30 05:17:55 macleod varnishd[4423]: Child (17381) died signal=3
Mar 30 05:17:55 macleod varnishd[4423]: Child cleanup complete
Mar 30 05:17:56 macleod varnishd[4423]: child (8349) Started
Mar 30 05:18:06 macleod varnishd[4423]: Pushing vcls failed:#012CLI communication error (hdr)
Mar 30 05:18:06 macleod varnishd[4423]: Stopping Child
Mar 30 05:18:06 macleod varnishd[4423]: Child (8349) said Child starts
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) said SMF.s0 mmap'ed 5368709120 bytes of 5368709120
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) said Child dies
Mar 30 05:18:18 macleod varnishd[4423]: Child (8349) died status=1
Mar 30 05:18:18 macleod varnishd[4423]: Child cleanup complete

comment:2 Changed 2 years ago by arn

I'm seeing a similar issue.

Varnish 3.0.2 / Centos 5. I've seen Varnish 3.0.2 freeze, not respond to requests. Also have trouble restarting it.

{{{Apr 9 14:34:20 web4 varnishd[28609]: Child (28611) not responding to CLI, killing it. Apr 9 14:34:26 web4 last message repeated 2 times Apr 9 14:34:26 web4 varnishd[28609]: Child (28611) died signal=3 Apr 9 14:34:26 web4 varnishd[28609]: child (9154) Started Apr 9 14:34:26 web4 varnishd[28609]: Child (9154) said Child starts Apr 9 14:35:03 web4 varnishd[28609]: Child (9154) not responding to CLI, killing it. Apr 9 14:35:33 web4 last message repeated 3 times Apr 9 14:36:14 web4 last message repeated 5 times Apr 9 14:36:14 web4 varnishd[28609]: Child (9154) died signal=3 Apr 9 14:36:14 web4 varnishd[28609]: child (10178) Started Apr 9 14:36:15 web4 varnishd[28609]: Child (10178) said Child starts Apr 9 14:37:02 web4 varnishd[28609]: Child (10178) not responding to CLI, killing it. Apr 9 14:37:11 web4 last message repeated 2 times Apr 9 14:37:11 web4 varnishd[28609]: Child (10178) died signal=3 Apr 9 14:37:11 web4 varnishd[28609]: child (10487) Started Apr 9 14:37:12 web4 varnishd[28609]: Child (10487) said Child starts Apr 9 14:37:28 web4 varnishd[28609]: Child (10487) not responding to CLI, killing it. Apr 9 14:37:47 web4 last message repeated 3 times Apr 9 14:37:47 web4 varnishd[28609]: Child (10487) died signal=3 Apr 9 14:37:47 web4 varnishd[28609]: child (10823) Started Apr 9 14:37:47 web4 varnishd[28609]: Child (10823) said Child starts }}}

@campisano : you don't happen to be using NewRelic?, are you? The timing of these freezes (3 in the last month) was after we had started using newrelic on this machine. not sure if that's just coincidence.

comment:3 Changed 2 years ago by campisano

@arn : I'm not using NewRelic?.

Your log have not the message "Pushing vcls failed:#012CLI": the message "Child (XXXX) not responding to CLI, killing it." could be common, I read in some place that can be a timeout issue with your backend service (ex apache): if apache response time is greather then the maximum timeout of the varnish worker, the varnish controller can kill it and stop the request.

You can increase the default timeout config on your varnish config file, like this example:

.connect_timeout = 5s; .first_byte_timeout = 10s;

I hope this can help. BTW I don't know the default values...

comment:4 Changed 2 years ago by phk

  • Description modified (diff)

comment:5 Changed 2 years ago by phk

The underlying cause is very likely disk-activity pileup, but the exact cause can be many different things.

You can try making the cli_timeout longer, maybe a minute, and see if that changes the picture. Making it longer than a minute is probably not sensible.

comment:6 Changed 2 years ago by campisano

Hi phk,

I had increase the connect_timeout to 5s few days ago. What's the defalut value for it?

BTW, I don't consider the message 'Child (17381) not responding to CLI, killing it.' a problem in my case, 'Pushing vcls failed:#012CLI communication error (hdr)' appear more interesting but the problem is that the varnish was freezed at all.

Did you seen the log file in attachment?

Thanks.

comment:7 follow-ups: ↓ 10 ↓ 14 Changed 2 years ago by phk

  • Status changed from new to closed
  • Resolution set to worksforme

The connect_timeout has nothing to do with starting the child process.

As I said, you can try to increase cli_timeout, if the problem is disk-i/o pileups.

comment:8 Changed 2 years ago by campisano

I had the same problem again. The varnish freezed and a simple restart solve. I'll test the cli_timeout increase to 60 secs, and report if it solve.

May 20 01:59:39 macleod varnishd[22546]: Child (19229) not responding to CLI, killing it.
May 20 01:59:40 macleod varnishd[22546]: Child (19229) not responding to CLI, killing it.
May 20 01:59:40 macleod varnishd[22546]: Child (19229) died signal=3
May 20 01:59:40 macleod varnishd[22546]: Child cleanup complete
May 20 01:59:40 macleod varnishd[22546]: child (16809) Started
May 20 01:59:40 macleod varnishd[22546]: Child (16809) said Child starts
May 20 01:59:40 macleod varnishd[22546]: Child (16809) said SMF.s0 mmap'ed 536870912 bytes of 536870912
May 20 05:17:58 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:08 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:18 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:28 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:38 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:48 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:18:58 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:19:08 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:19:11 macleod varnishd[22546]: Child (16809) not responding to CLI, killing it.
May 20 05:19:11 macleod varnishd[22546]: Child (16809) died signal=3
May 20 05:19:11 macleod varnishd[22546]: Child cleanup complete
May 20 05:19:11 macleod varnishd[22546]: child (11945) Started
May 20 05:19:21 macleod varnishd[22546]: Pushing vcls failed:#012CLI communication error (hdr)
May 20 05:19:21 macleod varnishd[22546]: Stopping Child
May 20 05:19:21 macleod varnishd[22546]: Child (11945) said Child starts
May 20 05:20:14 macleod varnishd[22546]: Child (11945) said SMF.s0 mmap'ed 536870912 bytes of 536870912
May 20 05:20:14 macleod varnishd[22546]: Child (11945) said Child dies
May 20 05:20:16 macleod varnishd[22546]: Child (11945) died status=1
May 20 05:20:16 macleod varnishd[22546]: Child cleanup complete

comment:9 Changed 2 years ago by campisano

Some links:

the default value is 10s

 https://www.varnish-cache.org/docs/3.0/reference/varnishd.html?highlight=cli_timeout

someone used this parameter to resolve issues that may be similars:

comment:10 in reply to: ↑ 7 ; follow-up: ↓ 11 Changed 20 months ago by lampe

Hi.

Replying to phk:

The connect_timeout has nothing to do with starting the child process.

As I said, you can try to increase cli_timeout, if the problem is disk-i/o pileups.

While I can understand the root cause for timeouts in master/client communications and fixed it by moving the shm log to tmpfs, I still think there's a bug or at least unexpected behaviour here.

When the master fails to push the initial vcl to the child, it kills the child but does not try to restart it. The master process is left hanging useless without a child and requires a stop/start cycle.

I can reproduce the problem with the shm log on HDD and large I/O, e.g.:

while [ `pidof varnishd |tr " " "\n" |wc -l` -eq 2 ]; do

dd if=/dev/zero of=test bs=1M count=20480 conf=fdatasync

done

(varnish 3.0.3)

comment:11 in reply to: ↑ 10 ; follow-up: ↓ 12 Changed 20 months ago by campisano

Replying to lampe:

While I can understand the root cause for timeouts in master/client communications and fixed it by moving the shm log to tmpfs, I still think there's a bug or at least unexpected behaviour here.

Hi lampe,

is the root the master process that handle the 'varnish childs workers' ? Reducing the shm log size on disk can resolve the problem ??

thanks.

comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed 20 months ago by lampe

Replying to campisano:

is the root the master process that handle the 'varnish childs workers' ?

Not quite. As I understand it (haven't actually checked with the code yet), the child blocks on a write to the shared memory file when cached disk writes exceed a certain amount and the disk is busy. With large RAM, the disk sync can take several seconds. If the child blocks long enough, the VCL upload from the master process times out and the child is terminated but not restarted.

Reducing the shm log size on disk can resolve the problem ??

I don't think so. Putting the shm log file on tmpfs resolved it for me. No physical disk, thus no waiting on the I/O scheduler. The Varnish Book explicitly recommends that the log must not cause physical disk I/O.

I'll also experiment with lowering /proc/sys/vm/dirty_background_bytes to <1GB and increasing dirty_ratio to 50%. See  http://www.kernel.org/doc/Documentation/sysctl/vm.txt for a description of these linux kernel parameters.

comment:13 in reply to: ↑ 12 Changed 20 months ago by campisano

Replying to lampe:

the child blocks on a write to the shared memory file when cached disk writes exceed a certain amount and the disk is busy. With large RAM, the disk sync can take several seconds. If the child blocks long enough, the VCL upload from the master process times out and the child is terminated but not restarted.

Putting the shm log file on tmpfs resolved it for me. No physical disk, thus no waiting on the I/O scheduler. The Varnish Book explicitly recommends that the log must not cause physical disk I/O.

I understand, so the best way appears to use RAM directly or with tempfs. But I have not sufficent RAM to give to the varnish log, so what happen when varnish want store something and have no more space to write? I suppose that he discard something old, do the call to the apache and all can work fine, I'm right?

However, why using tempfs where can use RAM directly? Varnish could have a little overhead with tempfs that is not necessary I suppose, or I'm wrong?

I'll also experiment with lowering /proc/sys/vm/dirty_background_bytes to <1GB and increasing dirty_ratio to 50%. See  http://www.kernel.org/doc/Documentation/sysctl/vm.txt for a description of these linux kernel parameters.

Nice, but this is for all the process on the server... However, did you get good results with this configs (without use tempfs)?

Thanks you lampe.

comment:14 in reply to: ↑ 7 Changed 17 months ago by seldaek

  • Status changed from closed to reopened
  • Resolution worksforme deleted

Replying to phk:

The connect_timeout has nothing to do with starting the child process.

As I said, you can try to increase cli_timeout, if the problem is disk-i/o pileups.

@phk - I believe in light of #10 that this should be reopened. I experienced the same yesterday, after many months of smooth execution Varnish just killed its child and failed to restart one then gave up on it entirely. I don't think it's right that *one* failure in months results in a useless server not accepting connections anymore. If it fails repeatedly I could maybe see some value in halting the process but not for one odd failure.

Apologies if reopening this is out of line, but if it is the intended behavior I would like to hear a short explanation why.

See this from my syslog in case it helps, but it's nothing new (and yes I now increased my cli_timeout for safety, but I still think this issue should be addressed):

Nov 24 06:48:37 web varnishd[29768]: Child (26813) not responding to CLI, killing it.
Nov 24 06:49:13 web varnishd[29768]: Child (26813) died signal=3 (core dumped)
Nov 24 06:49:13 web varnishd[29768]: Child cleanup complete
Nov 24 06:49:13 web varnishd[29768]: child (1256) Started
Nov 24 06:49:24 web varnishd[29768]: Pushing vcls failed:#012CLI communication error (hdr)
Nov 24 06:49:24 web varnishd[29768]: Stopping Child
Nov 24 06:49:24 web varnishd[29768]: Child (1256) said Child starts
Nov 24 06:49:27 web varnishd[29768]: Child (1256) said Child dies
Nov 24 06:49:31 web varnishd[29768]: Child (1256) died status=1
Nov 24 06:49:31 web varnishd[29768]: Child cleanup complete

comment:15 follow-up: ↓ 16 Changed 17 months ago by martin

  • Status changed from reopened to closed
  • Resolution set to worksforme

We have added this as a future feature item, to have some parameter to control restart handling. See  https://www.varnish-cache.org/trac/wiki/Future_Usability#Parametertocontrolrestartbehavior

Regards, Martin Blix Grydeland

comment:16 in reply to: ↑ 15 ; follow-up: ↓ 17 Changed 17 months ago by seldaek

Replying to martin:

We have added this as a future feature item, to have some parameter to control restart handling.

Being able to force it to retry N times is an improvement, but I still think that it should just exit at the point where it has no child process alive and can not start them anymore. Otherwise Varnish keeps running and as far as the OS and any process-based monitoring is concerned all is well while the system is actually unusable. This is quite a serious issue IMO.

comment:17 in reply to: ↑ 16 Changed 17 months ago by martin

Replying to seldaek:

Being able to force it to retry N times is an improvement, but I still think that it should just exit at the point where it has no child process alive and can not start them anymore. Otherwise Varnish keeps running and as far as the OS and any process-based monitoring is concerned all is well while the system is actually unusable. This is quite a serious issue IMO.

Added to the wiki page.

Martin

Note: See TracTickets for help on using tickets.