Ticket #1252 (closed defect: fixed)

Opened 15 months ago

Last modified 14 months ago

varnish retries fetch from unresponsive backend indefinitely, leading to FD exhaustion

Reported by: askalski Owned by: martin
Priority: normal Milestone:
Component: build Version: trunk
Severity: normal Keywords:
Cc:

Description

When a backend request fails (e.g. first_byte_timeout), varnish returns a 503 error only to one client waiting on that object. It then retries the backend request, and keeps doing so as long as there are sessions on the waiting list. These retries are made even for clients that have already timed out and closed their HTTP connection to varnish. This queueing of clients and serialization of backend requests leads to file descriptor exhaustion.

This was originally observed on 2.1.5 with the DNS director, but has been reproduced on Trunk with a single backend (no director), and also with the random director configured with .retries=1

Suggested behavior

After some number of failed retries, varnish should flush out the object's entire waitinglist with an immediate 503 error.

Steps to reproduce

Start with a fresh build of trunk. Configure the backend with a short .first_byte_timeout such as 5 seconds.

default.vcl

backend default {
        .host = "127.0.0.1";
        .port = "80"; 
        .first_byte_timeout = 5s;
}       

On the backend, write a script that delays longer than first_byte_timeout. For example, this PHP script sleeps for 10 seconds before returning:

slow.php

<?php sleep(10); ?>

Start varnish:

$ /tmp/x/sbin/varnishd -d -f /tmp/x/etc/varnish/default.vcl -a localhost:6081
Message from VCC-compiler:
Not running as root, no priv-sep
Message from C-compiler:
Not running as root, no priv-sep
Message from dlopen:
Not running as root, no priv-sep
Platform: Linux,3.2.0-36-generic,x86_64,-sfile,-smalloc,-hcritbit
200 273     
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,3.2.0-36-generic,x86_64,-sfile,-smalloc,-hcritbit
varnish-trunk revision d175906

Type 'help' for command list.
Type 'quit' to close CLI session.
Type 'start' to launch worker process.

start
child (9286) Started
200 0       

Child (9286) said Not running as root, no priv-sep
Child (9286) said Child starts
Child (9286) said SMF.s0 mmap'ed 104857600 bytes of 104857600

In one window, use "lsof" to monitor varnish's open file descriptors:

$ watch 'lsof -P -n -p 9286'

Every 2.0s: lsof -P -n -p 9286 | grep TCP                                                                   Mon Jan 21 16:22:33 2013

varnishd 9286 askalski    5u  IPv4             680493       0t0     TCP 127.0.0.1:6081 (LISTEN)

In another window, run curl in a loop to request the delay script, using a timeout shorter than first_byte_timeout (for example, 1 second.)

$ while sleep 1; do curl -m1 http://localhost:6081/slow.php; done
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
curl: (28) Operation timed out after 1001 milliseconds with 0 bytes received
...

Watch as varnish's file descriptor list grows:

Every 2.0s: lsof -P -n -p 9286 | grep TCP                                                                   Mon Jan 21 16:25:22 2013

varnishd 9286 askalski    5u  IPv4             680493       0t0     TCP 127.0.0.1:6081 (LISTEN)
varnishd 9286 askalski   13u  IPv4             682462       0t0     TCP 127.0.0.1:6081->127.0.0.1:57800 (CLOSE_WAIT)
varnishd 9286 askalski   14u  IPv4             680904       0t0     TCP 127.0.0.1:6081->127.0.0.1:57793 (CLOSE_WAIT)
varnishd 9286 askalski   15u  IPv4             683250       0t0     TCP 127.0.0.1:6081->127.0.0.1:57774 (CLOSE_WAIT)
varnishd 9286 askalski   16u  IPv4             684243       0t0     TCP 127.0.0.1:40133->127.0.0.1:80 (ESTABLISHED)
varnishd 9286 askalski   17u  IPv4             682412       0t0     TCP 127.0.0.1:6081->127.0.0.1:57788 (CLOSE_WAIT)
varnishd 9286 askalski   18u  IPv4             685075       0t0     TCP 127.0.0.1:6081->127.0.0.1:57809 (CLOSE_WAIT)
varnishd 9286 askalski   19u  IPv4             683301       0t0     TCP 127.0.0.1:6081->127.0.0.1:57780 (CLOSE_WAIT)
varnishd 9286 askalski   20u  IPv4             680907       0t0     TCP 127.0.0.1:6081->127.0.0.1:57795 (CLOSE_WAIT)
varnishd 9286 askalski   21u  IPv4             682466       0t0     TCP 127.0.0.1:6081->127.0.0.1:57802 (CLOSE_WAIT)
varnishd 9286 askalski   22u  IPv4             680883       0t0     TCP 127.0.0.1:6081->127.0.0.1:57787 (CLOSE_WAIT)
varnishd 9286 askalski   23u  IPv4             680888       0t0     TCP 127.0.0.1:6081->127.0.0.1:57790 (CLOSE_WAIT)
varnishd 9286 askalski   24u  IPv4             685068       0t0     TCP 127.0.0.1:6081->127.0.0.1:57807 (CLOSE_WAIT)
varnishd 9286 askalski   26u  IPv4             682457       0t0     TCP 127.0.0.1:6081->127.0.0.1:57797 (CLOSE_WAIT)
varnishd 9286 askalski   27u  IPv4             680924       0t0     TCP 127.0.0.1:6081->127.0.0.1:57798 (CLOSE_WAIT)
varnishd 9286 askalski   28u  IPv4             680959       0t0     TCP 127.0.0.1:6081->127.0.0.1:57801 (CLOSE_WAIT)
varnishd 9286 askalski   29u  IPv4             685059       0t0     TCP 127.0.0.1:6081->127.0.0.1:57804 (CLOSE_WAIT)
varnishd 9286 askalski   30u  IPv4             682499       0t0     TCP 127.0.0.1:6081->127.0.0.1:57805 (CLOSE_WAIT)
varnishd 9286 askalski   31u  IPv4             682536       0t0     TCP 127.0.0.1:6081->127.0.0.1:57808 (CLOSE_WAIT)
varnishd 9286 askalski   32u  IPv4             682555       0t0     TCP 127.0.0.1:6081->127.0.0.1:57811 (CLOSE_WAIT)

Change History

comment:1 Changed 15 months ago by martin

  • Owner set to martin

comment:2 Changed 14 months ago by martin

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.