[Varnish] #1696: Varnish leaves connections in CLOSE_WAIT

Fri Mar 27 11:27:37 CET 2015

#1696: Varnish leaves connections in CLOSE_WAIT
----------------------+--------------------
 Reporter:  rwimmer   |       Owner:
     Type:  defect    |      Status:  new
 Priority:  high      |   Milestone:
Component:  build     |     Version:  3.0.5
 Severity:  critical  |  Resolution:
 Keywords:            |
----------------------+--------------------

Comment (by rwimmer):

 We've investigated the issue a little bit further. Here you can see the
 output of "ss" (a utility to dump socket statistics):

 ss -t -o -p

 {{{
 State      Recv-Q Send-Q      Local Address:Port          Peer
 Address:Port
 CLOSE-WAIT 1      0           172.18.97.170:16081
 217.160.127.250:52031    users:(("varnishd",17527,7217))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 89.176.201.80:23299    users:(("varnishd",17527,7485))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 178.24.33.119:14975    users:(("varnishd",17527,5532))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 212.95.7.209:45404    users:(("varnishd",17527,11271))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 178.24.33.119:2733     users:(("varnishd",17527,7944))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 158.195.75.102:55999    users:(("varnishd",17527,5198))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 188.6.166.158:61071    users:(("varnishd",17527,15792))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 2.247.148.115:4851     users:(("varnishd",17527,14971))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 83.77.186.26:27446    users:(("varnishd",17527,12737))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 37.201.240.119:49509    users:(("varnishd",17527,7343))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 77.22.87.38:40169    users:(("varnishd",17527,10276))
 CLOSE-WAIT 1      0           172.18.97.170:16081
 78.139.25.24:31947    users:(("varnishd",17527,5613))
 ....
 }}}

 As you can see all the connections have a receive queue size of 1. All
 services besides Varnish have their Resv-Q at 0 as long as the state isn't
 established (ESTA). We think that this is the problem why the sockets
 never get released because the Resv-Q is not empty.

 Additionally we've tried to lower the "tcp_keepalive_time" kernel
 parameter from default 7200 to 600:

 {{{
 echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
 }}}

 We hoped that the kernel will detect that this connection doesn't do
 anything usefull add will kick it. But as expected nothing happens. The
 socket is still owned by Varnish and there is still something in the
 Resv-Q.

 Today I restarted Varnish at 10:27. Now it's 11:00. If you have a look at
 the /proc fs

 {{{
 ls -ld /proc/9455
 dr-xr-xr-x 9 nobody nogroup 0 Mar 27 10:27 /proc/9455

 ls -alrt /proc/9455/fd/ | more
 total 0
 dr-xr-xr-x 9 nobody nogroup  0 Mar 27 10:27 ..
 dr-x------ 2 nobody nogroup  0 Mar 27 10:27 .
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 99 -> socket:[812049314]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 98 -> socket:[812187431]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 97 -> socket:[812301205]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 96 -> socket:[812036401]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 95 -> socket:[812083869]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 94 -> socket:[812773476]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 93 -> socket:[812231242]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 92 -> socket:[812021756]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 91 -> socket:[812498505]
 lrwx------ 1 nobody nogroup 64 Mar 27 10:28 90 -> socket:[812351724]
 ...
 }}}

 you can see that the oldest socket connection is already 30 minutes old
 (before I restarted Varnish today I could see thousands of sockets that
 existed for over 10 hours). If you now grep for the oldest  socket id with
 lsof

 {{{
 lsof -n | grep 812049314
 varnishd   9455     nobody   99u     IPv4          812049314       0t0
 TCP 172.18.97.170:16081->84.42.225.26:37078 (CLOSE_WAIT)
 }}}

 you see that this connection is in CLOSE_WAIT (as most of the Varnish
 sockets). If I now look hours later this socket will still be there until
 Varnish is restarted.

 We've now set Varnish session_max=500000 and the number of max open files
 for Varnish to 600000:

 {{{
 cat /proc/9455/limits
 Limit                     Soft Limit           Hard Limit           Units
 Max cpu time              unlimited            unlimited
 seconds
 Max file size             unlimited            unlimited            bytes
 Max data size             unlimited            unlimited            bytes
 Max stack size            524288               524288               bytes
 Max core file size        0                    unlimited            bytes
 Max resident set          unlimited            unlimited            bytes
 Max processes             641254               641254
 processes
 Max open files            600000               600000               files
 Max locked memory         83968000             83968000             bytes
 Max address space         unlimited            unlimited            bytes
 Max file locks            unlimited            unlimited            locks
 Max pending signals       641254               641254
 signals
 Max msgqueue size         819200               819200               bytes
 Max nice priority         0                    0
 Max realtime priority     0                    0
 Max realtime timeout      unlimited            unlimited            us
 }}}

 But is this REALLY the solution? If connections doubles we need to set
 this values to over a million at least. I don't know if this has any side
 effects.

 We have a peak in traffic and connections at about 10 p.m. and then the
 traffic and connections decrease to a low rate at 12 p.m. and starts to
 increase again at 6 a.m. The crazy thing is that at 12 p.m. also the
 number of open files starts to decrease but it take 5 hours do decrease
 from 50000 to a few hundred  files Varnish has open.

 If you search Google for this topic you can find a lot of unanswered
 threads and no real solution. It would be cool if we could find some kind
 of solution about this issue. We've had this issue now on two completely
 different sites with different kernel, different OS and different Varnish
 version. At least a hint if raising session_max and the number of max open
 files to a very high number is okay would be very helpful!

-- 
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1696#comment:1>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator