[Varnish] #1696: Varnish leaves connections in CLOSE_WAIT

Thu Mar 26 13:20:46 CET 2015

#1696: Varnish leaves connections in CLOSE_WAIT
---------------------+----------------------
 Reporter:  rwimmer  |       Type:  defect
   Status:  new      |   Priority:  high
Milestone:           |  Component:  build
  Version:  3.0.5    |   Severity:  critical
 Keywords:           |
---------------------+----------------------
 We've seen this problem two weeks ago in a completely different project
 (in this case with Varnish 3.0.6) but it disappeared suddenly. Now it's
 back on another project with Varnish 3.0.5. Varnish is running for some
 hours when it suddenly doesn't accept some requests. Further
 investigations with lsof and netstat showed that there were around 100000
 connections in CLOSE_WAIT state.

 An example lsof output (note that Varnish listens on port 16081):

 lsof -n 2>&1 | grep varnish | grep CLOSE_WAIT

 COMMAND     PID       USER   FD      TYPE             DEVICE  SIZE/OFF
 NODE NAME
 varnishd  12682     nobody    3u     IPv4          788349978       0t0
 TCP 172.18.97.170:16081->185.15.111.18:35694 (CLOSE_WAIT)
 varnishd  12682     nobody    4u     IPv4          788793838       0t0
 TCP 172.18.97.170:16081->77.22.87.38:38329 (CLOSE_WAIT)
 varnishd  12682     nobody    5u     IPv4          788365096       0t0
 TCP 172.18.97.170:16081->46.164.62.152:23263 (CLOSE_WAIT)
 varnishd  12682     nobody    6u     IPv4          789230766       0t0
 TCP 172.18.97.170:16081->78.131.7.202:27775 (CLOSE_WAIT)
 varnishd  12682     nobody   10u     IPv4          789129001       0t0
 TCP 172.18.97.170:16081->46.164.62.152:31629 (CLOSE_WAIT)
 varnishd  12682     nobody   11u     IPv4          789352359       0t0
 TCP 172.18.97.170:16081->188.103.5.90:48825 (CLOSE_WAIT)
 varnishd  12682     nobody   18u     IPv4          788518839       0t0
 TCP 172.18.97.170:16081->92.50.98.10:59357 (CLOSE_WAIT)
 varnishd  12682     nobody   19u     IPv4          788720641       0t0
 TCP 172.18.97.170:16081->37.188.135.151:35951 (CLOSE_WAIT)
 varnishd  12682     nobody   20u     IPv4          788581054       0t0
 TCP 172.18.97.170:16081->81.182.85.246:54467 (CLOSE_WAIT)
 varnishd  12682     nobody   22u     IPv4          789159713       0t0
 TCP 172.18.97.170:16081->78.63.204.139:46028 (CLOSE_WAIT)
 varnishd  12682     nobody   23u     IPv4          789059708       0t0
 TCP 172.18.97.170:16081->213.192.25.210:24825 (CLOSE_WAIT)
 ....

 Similar with netstat -tonp 2>&1
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address           Foreign Address         State
 PID/Program name Timer
 ...
 tcp        1      0 172.18.97.172:16081     84.184.219.117:14371
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     46.39.187.224:52043
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     83.208.43.226:36693
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     188.6.166.158:39417
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     89.233.129.36:16291
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     37.203.126.22:60019
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 tcp        1      0 172.18.97.172:16081     37.188.135.151:57363
 CLOSE_WAIT  26536/varnishd   off (0.00/0/0)
 ...

 We see growing the amount of CLOSE_WAIT connections from the user to
 Varnish every minute until it counts about 100000. Then only a restart
 releases the connections. With dmesg we see messages like

 [4395699.833458] TCP: too many orphaned sockets
 [4395699.857379] TCP: too many orphaned sockets
 [4395700.035472] TCP: too many orphaned sockets
 [4395700.062136] TCP: too many orphaned sockets
 [4395700.212538] TCP: too many orphaned sockets
 [4395700.215065] TCP: too many orphaned sockets
 [4395701.105658] TCP: too many orphaned sockets
 [4395702.603942] TCP: too many orphaned sockets
 [4395707.474580] TCP: too many orphaned sockets

 after Varnish restart. If we look at the proc filesystem how many fd's the
 Varnish worker process has open we see same amount of sockets netstat or
 lsof shows:

 ls -al /proc/26536/fd/ | more
 total 0
 dr-x------ 2 nobody nogroup  0 Mar 26 10:48 .
 dr-xr-xr-x 9 nobody nogroup  0 Mar 26 10:47 ..
 lr-x------ 1 nobody nogroup 64 Mar 26 10:48 0 -> /dev/null
 l-wx------ 1 nobody nogroup 64 Mar 26 10:48 1 -> pipe:[785486723]
 lrwx------ 1 nobody nogroup 64 Mar 26 10:48 10 -> socket:[785975005]
 lrwx------ 1 nobody nogroup 64 Mar 26 10:48 100 -> socket:[787172993]
 lrwx------ 1 nobody nogroup 64 Mar 26 10:50 1000 -> socket:[786915607]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10000 -> socket:[786585448]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10001 -> socket:[786708112]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10002 -> socket:[787832286]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10003 -> socket:[786695486]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10004 -> socket:[787427586]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10005 -> socket:[787944180]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10006 -> socket:[786718487]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10007 -> socket:[786607488]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10008 -> socket:[786620992]
 lrwx------ 1 nobody nogroup 64 Mar 26 11:30 10009 -> socket:[786874970]
 ....

 Additionally we see growing number of Varnish open files with "can't
 identify protocol" in the NAME column of lsof output (but about only 10%
 of the amount in CLOSE_WAIT):

 lsof -n | grep varn | grep identify | more
 varnishd  12682     nobody   21u     sock                0,7       0t0
 788513770 can't identify protocol
 varnishd  12682     nobody   44u     sock                0,7       0t0
 789511210 can't identify protocol
 varnishd  12682     nobody   52u     sock                0,7       0t0
 789358574 can't identify protocol
 varnishd  12682     nobody   66u     sock                0,7       0t0
 789329635 can't identify protocol
 varnishd  12682     nobody   74u     sock                0,7       0t0
 789469904 can't identify protocol
 varnishd  12682     nobody   77u     sock                0,7       0t0
 788574512 can't identify protocol
 varnishd  12682     nobody  102u     sock                0,7       0t0
 789022929 can't identify protocol
 ...

 The problem starts at about 100000 connections in CLOSE_WAIT. It has
 certainly something to do with the Varnish session_max parameter which
 default is 100000. We've raised this now to 190000 and also increased the
 max open files to 250000 (from 130000 max open files):

 cat /proc/26536/limits
 Limit                     Soft Limit           Hard Limit           Units
 Max cpu time              unlimited            unlimited
 seconds
 Max file size             unlimited            unlimited            bytes
 Max data size             unlimited            unlimited            bytes
 Max stack size            524288               524288               bytes
 Max core file size        0                    unlimited            bytes
 Max resident set          unlimited            unlimited            bytes
 Max processes             641255               641255
 processes
 Max open files            250000               250000               files
 Max locked memory         83968000             83968000             bytes
 Max address space         unlimited            unlimited            bytes
 Max file locks            unlimited            unlimited            locks
 Max pending signals       641255               641255
 signals
 Max msgqueue size         819200               819200               bytes
 Max nice priority         0                    0
 Max realtime priority     0                    0
 Max realtime timeout      unlimited            unlimited            us

 But why doesn't Varnish release this sockets/connections suddenly? All
 sources we contacted or googled told us to fix the application - which in
 this case is Varnish... There're no CLOSE_WAIT timeouts in the Linux
 kernel (we use Ubuntu 12.04 LTS with kernel 3.13 btw.)

 With the default limit of session_max=100000 we currently need to restart
 Varnish 3-4 times a day. We've tweaked a lot of parameters, tried to
 eliminate keep-alive everywhere, ... but nothing changed. Only at night
 when requests decrease it seems Varnish is able to release more
 connections/sockets than it opens.

 We used this Varnish configuration and version unmodified for several
 month now without problems but since yesterday Varnish leaves connections
 in CLOSE_WAIT until we restart it (but even 3.0.6 have had the problem as
 mentioned above). In the change log of 3.0.7 I can't find any bug fixed
 related to our issue.

 I've attached a Munin graph which show's how fast the number of Varnish
 open files grows.

 Any hint's what could cause this issue?

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1696>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator