Ticket #849 (closed defect: invalid)

Opened 3 years ago

Last modified 12 months ago

Session timeout while receiving POST data from client causes multiple broken backend requests

Reported by: lew Owned by: tfheen
Priority: normal Milestone: Varnish 3.0 dev
Component: varnishd Version: 2.1.4
Severity: normal Keywords: 503, post, backend write error: 11 (Resource temporarily unavailable)
Cc:

Description

The default session timeout of 5s was causing a lot of broken POSTs on our backends, and 503 errors in our varnish logs (showing up as "backend write error: 11 (Resource temporarily unavailable)") - particularly with one of our sites that serves mainly mobile clients.

It appears that during a POST, if the session timeout is exceeded while varnish is waiting for data from the client, an incomplete/invalid POST request is made to the backend, and then tried again after another 5 seconds.

I'm not sure what the 'right' way to handle this would be, but this doesn't seem like good behaviour currently.

For the moment I'm working around the issue by piping POSTs.

Change History

comment:1 Changed 3 years ago by rtsh

I believe this is an issue. It affects us in our production environment when receiving requests from people on slow connections (typically it's affected visitors from India and the Phillipines, as well as mobile devices). We've worked around this by increasing sess_timeout to 60s as per the original comment. However, I feel this is closely associated with ticket 748.

To help someone debug this further, I have a precise test case attached. This was done on:

[root@xx ~]# rpm -qa |grep varnish
varnish-libs-2.1.5-1
varnish-2.1.5-1
[root@xx ~]# uname -a
Linux xx 2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@xx ~]# cat /etc/redhat-release 
CentOS release 5.5 (Final)
[root@xx ~]#

To reproduce the issue, install varnish from RPM, then change /etc/sysconfig/varnish to be:

NFILES=131072
MEMLOCK=82000
DAEMON_OPTS="-a 10.1.2.130:80  \
             -T :6082 \
             -f /etc/varnish/default.vcl \
             -u varnish -g varnish \
             -p purge_dups=on \
             -p shm_workspace=32768 \
	     -p ping_interval=1 \
             -p sess_workspace=65536 \
             -s file,/var/lib/varnish/varnish_storage.bin,1G"

Note that ping_interval is set to 1 simply to help produce more detailed timestamps in the varnishlog. Create a minimal VCL file, as follows:

backend b1 { .host = "127.0.0.1"; .port = "80"; .connect_timeout = 30s; .first_byte_timeout = 30s; .between_bytes_timeout = 30s; }
backend b2 { .host = "127.0.0.1"; .port = "80"; .connect_timeout = 30s; .first_byte_timeout = 30s; .between_bytes_timeout = 30s; }
backend b3 { .host = "127.0.0.1"; .port = "80"; .connect_timeout = 30s; .first_byte_timeout = 30s; .between_bytes_timeout = 30s; }

director pool random {
  { .backend = b1; .weight = 1; }
  { .backend = b2; .weight = 1; }
  { .backend = b3; .weight = 1; }
}


sub vcl_recv {
  set req.backend = pool;
  return (pass);
}

sub vcl_error { 
  if (req.restarts<2) {
    restart;
  }
}

To test, simply telnet to varnish on port 80 and send an incomplete POST request (just copy and paste as far as "test" and a new line - the HTTP/1.1 503 response comes back from Varnish after some time.)

[root@xx ~]# telnet a.b.c.130 80
Trying a.b.c.130...
Connected to xx (a.b.c.130).
Escape character is '^]'.
POST / HTTP/1.1
Host: foo
Content-length: 400

test
HTTP/1.1 503 Service Unavailable
Server: Varnish
Retry-After: 0
Content-Type: text/html; charset=utf-8
Content-Length: 419
Date: Fri, 18 Feb 2011 19:13:59 GMT
X-Varnish: 2120546936
Age: 15
Via: 1.1 varnish
Connection: close


<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>503 Service Unavailable</title>
  </head>
  <body>
    <h1>Error 503 Service Unavailable</h1>
    <p>Service Unavailable</p>
    <h3>Guru Meditation:</h3>
    <p>XID: 2120546936</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>
Connection closed by foreign host.
[root@xx ~]# 

The varnishlog follows. What is important to note is that despite each backend's timeout being set to 30 seconds for each possible timeout parameter, the total time to serve the request was 15 seconds, which equates to three times the sess_timeout parameter, and isn't affected by the backend definitions in the VCL.

    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056416 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056417 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056418 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056419 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056420 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056421 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056422 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056423 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056424 1.0
   11 SessionOpen  c a.b.c.130 48442 a.b.c.130:80
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056425 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056426 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056427 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056428 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056429 1.0
   11 ReqStart     c a.b.c.130 48442 2120546936
   11 RxRequest    c POST
   11 RxURL        c /
   11 RxProtocol   c HTTP/1.1
   11 RxHeader     c Host: foo
   11 RxHeader     c Content-length: 400
   11 VCL_call     c recv
   11 VCL_return   c pass
   11 VCL_call     c hash
   11 VCL_return   c hash
   11 VCL_call     c pass
   11 VCL_return   c pass
   13 BackendOpen  b b1 127.0.0.1 34200 127.0.0.1 80
   11 Backend      c 13 pool b1
   13 TxRequest    b POST
   13 TxURL        b /
   13 TxProtocol   b HTTP/1.1
   13 TxHeader     b Host: foo
   13 TxHeader     b Content-length: 400
   13 TxHeader     b X-Varnish: 2120546936
   11 FetchError   c backend write error: 11 (Resource temporarily unavailable)
   13 BackendClose b b1
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056430 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056431 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056432 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056433 1.0
   11 VCL_call     c error
   11 VCL_return   c restart
   11 VCL_call     c recv
   11 VCL_return   c pass
   11 VCL_call     c hash
   11 VCL_return   c hash
   11 VCL_call     c pass
   11 VCL_return   c pass
   13 BackendOpen  b b2 127.0.0.1 34201 127.0.0.1 80
   11 Backend      c 13 pool b2
   13 TxRequest    b POST
   13 TxURL        b /
   13 TxProtocol   b HTTP/1.1
   13 TxHeader     b Host: foo
   13 TxHeader     b Content-length: 400
   13 TxHeader     b X-Varnish: 2120546936
   11 FetchError   c backend write error: 11 (Resource temporarily unavailable)
   13 BackendClose b b2
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056434 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056435 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056436 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056437 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056438 1.0
   11 VCL_call     c error
   11 VCL_return   c restart
   11 VCL_call     c recv
   11 VCL_return   c pass
   11 VCL_call     c hash
   11 VCL_return   c hash
   11 VCL_call     c pass
   11 VCL_return   c pass
   13 BackendOpen  b b3 127.0.0.1 34202 127.0.0.1 80
   11 Backend      c 13 pool b3
   13 TxRequest    b POST
   13 TxURL        b /
   13 TxProtocol   b HTTP/1.1
   13 TxHeader     b Host: foo
   13 TxHeader     b Content-length: 400
   13 TxHeader     b X-Varnish: 2120546936
   11 FetchError   c backend write error: 11 (Resource temporarily unavailable)
   13 BackendClose b b3
   11 VCL_call     c error
   11 VCL_return   c deliver
   11 VCL_call     c deliver
   11 VCL_return   c deliver
   11 TxProtocol   c HTTP/1.1
   11 TxStatus     c 503
   11 TxResponse   c Service Unavailable
   11 TxHeader     c Server: Varnish
   11 TxHeader     c Retry-After: 0
   11 TxHeader     c Content-Type: text/html; charset=utf-8
   11 TxHeader     c Content-Length: 419
   11 TxHeader     c Date: Fri, 18 Feb 2011 19:13:59 GMT
   11 TxHeader     c X-Varnish: 2120546936
   11 TxHeader     c Age: 15
   11 TxHeader     c Via: 1.1 varnish
   11 TxHeader     c Connection: close
   11 Length       c 419
   11 ReqEnd       c 2120546936 1298056424.244601965 1298056439.268516064 0.000581026 15.023117065 0.000797033
   11 SessionClose c error
   11 StatSess     c a.b.c.130 48442 15 1 1 0 3 0 236 419
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056439 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056440 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056441 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056442 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056443 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056444 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056445 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056446 1.0
    0 CLI          - Rd ping
    0 CLI          - Wr 200 19 PONG 1298056447 1.0

comment:2 Changed 3 years ago by kristian

  • Owner set to kristian
  • Milestone set to Varnish 3.0 dev

I'm testing and assessing impact for 3.0.

comment:3 Changed 3 years ago by nathan

I can also confirm this issue, worked around using:

    /* pipeline post requests trac #4124 */
    if (req.request == "POST") {
        return(pipe);
    }

and added this:

sub vcl_pipe {
    /* Force the connection to be closed afterwards so subsequent reqs
don't use pipe */
    set bereq.http.connection = "close";
}

Issue was happening with post file uploads from clients on a slow connection.

comment:4 Changed 2 years ago by tfheen

  • Owner changed from kristian to tfheen

comment:5 follow-up: ↓ 10 Changed 2 years ago by phk

See also #1041.

There are a couple of issues in this:

  1. We use sess_timeout also during client->varnish body transfer. It might be better with separate first/between timeouts like we have on the backend side.
  1. We don't buffer the body we get from the client, so restart from VCL is never a good idea.
  1. We have the automagic "try another fd" pseudo-restart, and it might happen in the middle of a body from the client. This would be less of an issue if 2. is fixed, but we still need a better way to handle that.

comment:6 Changed 2 years ago by Emil

We had this issue too, resulting in image uploads failing.

It gave us a bunch of these in the backen logs.

[Wed Apr 25 12:21:56 2012] [warn] [client x.x.x.x] (70014)End of file found: mod_fcgid: can't get data from http client, referer: http://www.example.com/the-url

In varnish we got the same error as reported above.

12 FetchError   c backend write error: 11

I think the severity of this issue should be raised.

comment:7 follow-up: ↓ 8 Changed 22 months ago by vrobert

Hello,

we had this issue too. I've done some tests: disabling keepalive or increasing the keepalive on apache does nothing.

The thing is bizzare because some of our customers seem to be more affected than others. (Slow connection, perhaps but not sure)

I don't want to use pipe. Increasing of sess_timeout have extended the waiting time before the error arrives, but the number of errors is always the same.

Statistics on 06/26/2012: 1571434 POST - 3336 POST with 503 (0.21%) - sess_timeout = 5 Statistics on 06/27/2012: 1391161 POST - 3011 POST with 503 (0.19%) - sess_timeout = 5 Statistics on 06/28/2012: 1322403 POST - 2493 POST with 503 (0.20%) - sess_timeout = 5 Statistics on 06/29/2012: 1328187 POST - 2645 POST with 503 (0.21%) - sess_timeout = 5 & 10 Statistics on 06/30/2012: 1201337 POST - 2593 POST with 503 (0.22%) - sess_timeout = 10 Statistics on 07/01/2012: 1270146 POST - 2841 POST with 503 (0.21%) - sess_timeout = 10

comment:8 in reply to: ↑ 7 Changed 22 months ago by vrobert

I forgot to write that we use Varnish 2.1.5.

Replying to vrobert:

Hello,

we had this issue too. I've done some tests: disabling keepalive or increasing the keepalive on apache does nothing.

The thing is bizzare because some of our customers seem to be more affected than others. (Slow connection, perhaps but not sure)

I don't want to use pipe. Increasing of sess_timeout have extended the waiting time before the error arrives, but the number of errors is always the same.

Statistics on 06/26/2012: 1571434 POST - 3336 POST with 503 (0.21%) - sess_timeout = 5 Statistics on 06/27/2012: 1391161 POST - 3011 POST with 503 (0.19%) - sess_timeout = 5 Statistics on 06/28/2012: 1322403 POST - 2493 POST with 503 (0.20%) - sess_timeout = 5 Statistics on 06/29/2012: 1328187 POST - 2645 POST with 503 (0.21%) - sess_timeout = 5 & 10 Statistics on 06/30/2012: 1201337 POST - 2593 POST with 503 (0.22%) - sess_timeout = 10 Statistics on 07/01/2012: 1270146 POST - 2841 POST with 503 (0.21%) - sess_timeout = 10

comment:9 Changed 22 months ago by phk

  • Status changed from new to closed
  • Resolution set to invalid

The problem is that we don't buffer the request-body (if there is one).

This is really a feature-request, so I'm closing this ticket.

We have an item about this in Future_Feature.

comment:10 in reply to: ↑ 5 Changed 12 months ago by ruben

Adding this as a new ticket.

Last edited 12 months ago by ruben (previous) (diff)
Note: See TracTickets for help on using tickets.