[Varnish] #1696: Varnish leaves connections in CLOSE_WAIT
Varnish
varnish-bugs at varnish-cache.org
Thu Mar 26 13:20:46 CET 2015
#1696: Varnish leaves connections in CLOSE_WAIT
---------------------+----------------------
Reporter: rwimmer | Type: defect
Status: new | Priority: high
Milestone: | Component: build
Version: 3.0.5 | Severity: critical
Keywords: |
---------------------+----------------------
We've seen this problem two weeks ago in a completely different project
(in this case with Varnish 3.0.6) but it disappeared suddenly. Now it's
back on another project with Varnish 3.0.5. Varnish is running for some
hours when it suddenly doesn't accept some requests. Further
investigations with lsof and netstat showed that there were around 100000
connections in CLOSE_WAIT state.
An example lsof output (note that Varnish listens on port 16081):
lsof -n 2>&1 | grep varnish | grep CLOSE_WAIT
COMMAND PID USER FD TYPE DEVICE SIZE/OFF
NODE NAME
varnishd 12682 nobody 3u IPv4 788349978 0t0
TCP 172.18.97.170:16081->185.15.111.18:35694 (CLOSE_WAIT)
varnishd 12682 nobody 4u IPv4 788793838 0t0
TCP 172.18.97.170:16081->77.22.87.38:38329 (CLOSE_WAIT)
varnishd 12682 nobody 5u IPv4 788365096 0t0
TCP 172.18.97.170:16081->46.164.62.152:23263 (CLOSE_WAIT)
varnishd 12682 nobody 6u IPv4 789230766 0t0
TCP 172.18.97.170:16081->78.131.7.202:27775 (CLOSE_WAIT)
varnishd 12682 nobody 10u IPv4 789129001 0t0
TCP 172.18.97.170:16081->46.164.62.152:31629 (CLOSE_WAIT)
varnishd 12682 nobody 11u IPv4 789352359 0t0
TCP 172.18.97.170:16081->188.103.5.90:48825 (CLOSE_WAIT)
varnishd 12682 nobody 18u IPv4 788518839 0t0
TCP 172.18.97.170:16081->92.50.98.10:59357 (CLOSE_WAIT)
varnishd 12682 nobody 19u IPv4 788720641 0t0
TCP 172.18.97.170:16081->37.188.135.151:35951 (CLOSE_WAIT)
varnishd 12682 nobody 20u IPv4 788581054 0t0
TCP 172.18.97.170:16081->81.182.85.246:54467 (CLOSE_WAIT)
varnishd 12682 nobody 22u IPv4 789159713 0t0
TCP 172.18.97.170:16081->78.63.204.139:46028 (CLOSE_WAIT)
varnishd 12682 nobody 23u IPv4 789059708 0t0
TCP 172.18.97.170:16081->213.192.25.210:24825 (CLOSE_WAIT)
....
Similar with netstat -tonp 2>&1
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
PID/Program name Timer
...
tcp 1 0 172.18.97.172:16081 84.184.219.117:14371
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 46.39.187.224:52043
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 83.208.43.226:36693
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 188.6.166.158:39417
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 89.233.129.36:16291
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 37.203.126.22:60019
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
tcp 1 0 172.18.97.172:16081 37.188.135.151:57363
CLOSE_WAIT 26536/varnishd off (0.00/0/0)
...
We see growing the amount of CLOSE_WAIT connections from the user to
Varnish every minute until it counts about 100000. Then only a restart
releases the connections. With dmesg we see messages like
[4395699.833458] TCP: too many orphaned sockets
[4395699.857379] TCP: too many orphaned sockets
[4395700.035472] TCP: too many orphaned sockets
[4395700.062136] TCP: too many orphaned sockets
[4395700.212538] TCP: too many orphaned sockets
[4395700.215065] TCP: too many orphaned sockets
[4395701.105658] TCP: too many orphaned sockets
[4395702.603942] TCP: too many orphaned sockets
[4395707.474580] TCP: too many orphaned sockets
after Varnish restart. If we look at the proc filesystem how many fd's the
Varnish worker process has open we see same amount of sockets netstat or
lsof shows:
ls -al /proc/26536/fd/ | more
total 0
dr-x------ 2 nobody nogroup 0 Mar 26 10:48 .
dr-xr-xr-x 9 nobody nogroup 0 Mar 26 10:47 ..
lr-x------ 1 nobody nogroup 64 Mar 26 10:48 0 -> /dev/null
l-wx------ 1 nobody nogroup 64 Mar 26 10:48 1 -> pipe:[785486723]
lrwx------ 1 nobody nogroup 64 Mar 26 10:48 10 -> socket:[785975005]
lrwx------ 1 nobody nogroup 64 Mar 26 10:48 100 -> socket:[787172993]
lrwx------ 1 nobody nogroup 64 Mar 26 10:50 1000 -> socket:[786915607]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10000 -> socket:[786585448]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10001 -> socket:[786708112]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10002 -> socket:[787832286]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10003 -> socket:[786695486]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10004 -> socket:[787427586]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10005 -> socket:[787944180]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10006 -> socket:[786718487]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10007 -> socket:[786607488]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10008 -> socket:[786620992]
lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10009 -> socket:[786874970]
....
Additionally we see growing number of Varnish open files with "can't
identify protocol" in the NAME column of lsof output (but about only 10%
of the amount in CLOSE_WAIT):
lsof -n | grep varn | grep identify | more
varnishd 12682 nobody 21u sock 0,7 0t0
788513770 can't identify protocol
varnishd 12682 nobody 44u sock 0,7 0t0
789511210 can't identify protocol
varnishd 12682 nobody 52u sock 0,7 0t0
789358574 can't identify protocol
varnishd 12682 nobody 66u sock 0,7 0t0
789329635 can't identify protocol
varnishd 12682 nobody 74u sock 0,7 0t0
789469904 can't identify protocol
varnishd 12682 nobody 77u sock 0,7 0t0
788574512 can't identify protocol
varnishd 12682 nobody 102u sock 0,7 0t0
789022929 can't identify protocol
...
The problem starts at about 100000 connections in CLOSE_WAIT. It has
certainly something to do with the Varnish session_max parameter which
default is 100000. We've raised this now to 190000 and also increased the
max open files to 250000 (from 130000 max open files):
cat /proc/26536/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited
seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 524288 524288 bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 641255 641255
processes
Max open files 250000 250000 files
Max locked memory 83968000 83968000 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 641255 641255
signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
But why doesn't Varnish release this sockets/connections suddenly? All
sources we contacted or googled told us to fix the application - which in
this case is Varnish... There're no CLOSE_WAIT timeouts in the Linux
kernel (we use Ubuntu 12.04 LTS with kernel 3.13 btw.)
With the default limit of session_max=100000 we currently need to restart
Varnish 3-4 times a day. We've tweaked a lot of parameters, tried to
eliminate keep-alive everywhere, ... but nothing changed. Only at night
when requests decrease it seems Varnish is able to release more
connections/sockets than it opens.
We used this Varnish configuration and version unmodified for several
month now without problems but since yesterday Varnish leaves connections
in CLOSE_WAIT until we restart it (but even 3.0.6 have had the problem as
mentioned above). In the change log of 3.0.7 I can't find any bug fixed
related to our issue.
I've attached a Munin graph which show's how fast the number of Varnish
open files grows.
Any hint's what could cause this issue?
--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1696>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list