[Varnish] #1696: Varnish leaves connections in CLOSE_WAIT
Varnish
varnish-bugs at varnish-cache.org
Fri Mar 27 11:27:37 CET 2015
#1696: Varnish leaves connections in CLOSE_WAIT
----------------------+--------------------
Reporter: rwimmer | Owner:
Type: defect | Status: new
Priority: high | Milestone:
Component: build | Version: 3.0.5
Severity: critical | Resolution:
Keywords: |
----------------------+--------------------
Comment (by rwimmer):
We've investigated the issue a little bit further. Here you can see the
output of "ss" (a utility to dump socket statistics):
ss -t -o -p
{{{
State Recv-Q Send-Q Local Address:Port Peer
Address:Port
CLOSE-WAIT 1 0 172.18.97.170:16081
217.160.127.250:52031 users:(("varnishd",17527,7217))
CLOSE-WAIT 1 0 172.18.97.170:16081
89.176.201.80:23299 users:(("varnishd",17527,7485))
CLOSE-WAIT 1 0 172.18.97.170:16081
178.24.33.119:14975 users:(("varnishd",17527,5532))
CLOSE-WAIT 1 0 172.18.97.170:16081
212.95.7.209:45404 users:(("varnishd",17527,11271))
CLOSE-WAIT 1 0 172.18.97.170:16081
178.24.33.119:2733 users:(("varnishd",17527,7944))
CLOSE-WAIT 1 0 172.18.97.170:16081
158.195.75.102:55999 users:(("varnishd",17527,5198))
CLOSE-WAIT 1 0 172.18.97.170:16081
188.6.166.158:61071 users:(("varnishd",17527,15792))
CLOSE-WAIT 1 0 172.18.97.170:16081
2.247.148.115:4851 users:(("varnishd",17527,14971))
CLOSE-WAIT 1 0 172.18.97.170:16081
83.77.186.26:27446 users:(("varnishd",17527,12737))
CLOSE-WAIT 1 0 172.18.97.170:16081
37.201.240.119:49509 users:(("varnishd",17527,7343))
CLOSE-WAIT 1 0 172.18.97.170:16081
77.22.87.38:40169 users:(("varnishd",17527,10276))
CLOSE-WAIT 1 0 172.18.97.170:16081
78.139.25.24:31947 users:(("varnishd",17527,5613))
....
}}}
As you can see all the connections have a receive queue size of 1. All
services besides Varnish have their Resv-Q at 0 as long as the state isn't
established (ESTA). We think that this is the problem why the sockets
never get released because the Resv-Q is not empty.
Additionally we've tried to lower the "tcp_keepalive_time" kernel
parameter from default 7200 to 600:
{{{
echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
}}}
We hoped that the kernel will detect that this connection doesn't do
anything usefull add will kick it. But as expected nothing happens. The
socket is still owned by Varnish and there is still something in the
Resv-Q.
Today I restarted Varnish at 10:27. Now it's 11:00. If you have a look at
the /proc fs
{{{
ls -ld /proc/9455
dr-xr-xr-x 9 nobody nogroup 0 Mar 27 10:27 /proc/9455
ls -alrt /proc/9455/fd/ | more
total 0
dr-xr-xr-x 9 nobody nogroup 0 Mar 27 10:27 ..
dr-x------ 2 nobody nogroup 0 Mar 27 10:27 .
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 99 -> socket:[812049314]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 98 -> socket:[812187431]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 97 -> socket:[812301205]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 96 -> socket:[812036401]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 95 -> socket:[812083869]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 94 -> socket:[812773476]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 93 -> socket:[812231242]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 92 -> socket:[812021756]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 91 -> socket:[812498505]
lrwx------ 1 nobody nogroup 64 Mar 27 10:28 90 -> socket:[812351724]
...
}}}
you can see that the oldest socket connection is already 30 minutes old
(before I restarted Varnish today I could see thousands of sockets that
existed for over 10 hours). If you now grep for the oldest socket id with
lsof
{{{
lsof -n | grep 812049314
varnishd 9455 nobody 99u IPv4 812049314 0t0
TCP 172.18.97.170:16081->84.42.225.26:37078 (CLOSE_WAIT)
}}}
you see that this connection is in CLOSE_WAIT (as most of the Varnish
sockets). If I now look hours later this socket will still be there until
Varnish is restarted.
We've now set Varnish session_max=500000 and the number of max open files
for Varnish to 600000:
{{{
cat /proc/9455/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited
seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 524288 524288 bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 641254 641254
processes
Max open files 600000 600000 files
Max locked memory 83968000 83968000 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 641254 641254
signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
}}}
But is this REALLY the solution? If connections doubles we need to set
this values to over a million at least. I don't know if this has any side
effects.
We have a peak in traffic and connections at about 10 p.m. and then the
traffic and connections decrease to a low rate at 12 p.m. and starts to
increase again at 6 a.m. The crazy thing is that at 12 p.m. also the
number of open files starts to decrease but it take 5 hours do decrease
from 50000 to a few hundred files Varnish has open.
If you search Google for this topic you can find a lot of unanswered
threads and no real solution. It would be cool if we could find some kind
of solution about this issue. We've had this issue now on two completely
different sites with different kernel, different OS and different Varnish
version. At least a hint if raising session_max and the number of max open
files to a very high number is okay would be very helpful!
--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1696#comment:1>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list