Ticket #1257 (closed defect: worksforme)

Opened 15 months ago

Last modified 9 months ago

Varnish restarting it self, large cache.

Reported by: anders-bazoom Owned by: phk
Priority: normal Milestone:
Component: build Version: 3.0.3
Severity: major Keywords:
Cc:

Description

Okay, we have a varnish server in front of an image server. We recently tried the file storage option instead of malloc, to increase our hit rates.

But it's giving us some problems. The master process will restart the child process on seemingly random(they are probably not random :p) intervals. Sometimes after 20minutes, longest it has lasted has been about 48 hours. This last time it restarted, I saw a panic message for the first time. The previous reboots only produced died signal=6

 http://pastebin.com/BTFzeYRd

varnish> panic.show
200
Last panic at: Tue, 29 Jan 2013 10:25:13 GMT
Assert error in default_oc_getobj(), stevedore.c line 65:
  Condition(((o))->magic == (0x32851d42)) not true.
thread = (cache-worker)
ident = Linux,2.6.32-37-server,x86_64,-sfile,-smalloc,-hcritbit,epoll
Backtrace:
  0x430768: /usr/sbin/varnishd() [0x430768]
  0x44867c: /usr/sbin/varnishd() [0x44867c]
  0x429f36: /usr/sbin/varnishd(HSH_Lookup+0x3a6) [0x429f36]
  0x416b19: /usr/sbin/varnishd() [0x416b19]
  0x41a265: /usr/sbin/varnishd(CNT_Session+0x705) [0x41a265]
  0x4324b1: /usr/sbin/varnishd() [0x4324b1]
  0x7f19771e89ca: /lib/libpthread.so.0(+0x69ca) [0x7f19771e89ca]
  0x7f1976f4521d: /lib/libc.so.6(clone+0x6d) [0x7f1976f4521d]
sp = 0x7e76e7665008 {
  fd = 218, id = 218, xid = 1472043706,
  client = 66.249.76.55 42918,
  step = STP_LOOKUP,
  handling = hash,
  restarts = 0, esi_level = 0
  flags =
  bodystatus = 4
  ws = 0x7e76e7665080 {
    id = "sess",
    {s,f,r,e} = {0x7e76e7665c78,+464,+65536,+65536},
  },
  http[req] = {
    ws = 0x7e76e7665080[sess]
      "GET",
      "/bruger/7/70/52/33443/EEEEEE/28-01-2013_213537",
      "HTTP/1.1",
      "Referer: http://www.bilgalleri.dk/forum/generel-diskussion/936318-vinterhjul_",
      "Connection: Keep-alive",
      "Accept: */*",
      "From: googlebot(at)googlebot.com",
      "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "host: billeder2.bilgalleri.dk",
      "X-Forwarded-For: 66.249.76.55",
      "Accept-Encoding: gzip",
  },
  worker = 0x7e76727fea90 {
    ws = 0x7e76727fecc8 {
      id = "wrk",
      {s,f,r,e} = {0x7e76727eca20,0x7e76727eca20,(nil),+65536},
    },
    },
    vcl = {
      srcname = {
        "input",
        "Default",
      },
    },
},

It's a virtual server, with 16gigs of ram. The storage size is set to 650gb.

Change History

comment:1 Changed 14 months ago by lkarsten

Thanks for reporting this. We need some more information.

Can you please supply the VCL and startup parameters used?

Also, you were running with identical setup without problems when using malloc before?

This being a vm, what kind of hypervisor is it and what kind of storage system is underneath?

Are you running Varnish built from source or from packages?

comment:2 Changed 14 months ago by anders-bazoom

Here's the startup options

DAEMON_OPTS="-a :80 \
             -T localhost:6082 \
             -f /etc/varnish/default.vcl \
             -S /etc/varnish/secret \
             -s file,/var/lib/varnish/varnish_storage.bin,650G \
             -p nuke_limit=3000"

And here's the default.vcl

backend bil {
        .host = "billeder2.bilgalleri.dk";
        .port = "80";
        .connect_timeout = 5s;
}

sub vcl_recv {
    if(req.http.host == "bilbilleder.invio.dk" || req.http.host == "bilbilleder2.invio.dk" || req.http.host == "bil.webgallerier.dk" || req.http.host == "bil2.webgallerier.dk"){
        set req.backend = bil;
        set req.http.host = "billeder2.bilgalleri.dk";
    }
}

sub vcl_fetch {
        if(beresp.status == 404){
                set beresp.ttl = 0s;
        }
}


The same VM is now running with malloc,10G - and has been running stable since I created this ticket.

The VM has 2x1TB in Raid1, filesystem is ext4. And the hypervisor im not sure about, I've seen hints at it being Parallels, but on the providers site there is mention of VMWare. So I've sent them a mail - and will report back here when I have the answer.

I installed varnish using the varnish package repository.

comment:3 Changed 14 months ago by anders-bazoom

Hi again, sorry for the late followup. I've gotten a response from the host.

They are using Parallels Bare Metal  http://www.parallels.com/products/server/baremetal/sp/

comment:4 Changed 13 months ago by phk

  • Owner set to phk

I'm not certain that there is much we can do in the short term, until we find a way to reproduce this where we can add debugging.

I'm keeping the ticket open, on the off-chance that they hypervisors scheduling have exposed a race-condition in the varnish code.

comment:5 Changed 9 months ago by phk

  • Status changed from new to closed
  • Resolution set to worksforme

I'm timing this ticket out now.

Note: See TracTickets for help on using tickets.