[Varnish] #324: Full freeze

Varnish varnish-bugs at projects.linpro.no
Sat Sep 20 11:47:14 CEST 2008


#324: Full freeze
----------------------+-----------------------------------------------------
 Reporter:  perbu     |        Owner:  phk                
     Type:  defect    |       Status:  new                
 Priority:  high      |    Milestone:  Varnish 2.0 release
Component:  varnishd  |      Version:  trunk              
 Severity:  blocker   |   Resolution:                     
 Keywords:            |  
----------------------+-----------------------------------------------------
Comment (by phk):

 As far as I can see, the situation is:

 {{{
 Thread 11 (the acceptor)
   is scheduling workrequest 0x9ae04b0
     containing session 0x9ae0338
   on the threadpool 0x9aa3260
     having picked worker 0x47134c10
       Which is closest to thread 6,
         based on wlb address
         Thread 6 is not first in queue
           so thread 3, 4 & 5 would have managed to get the mutex since.
           And thread 6 seems to be _on_ the queue
             which should be impossible.
       Varnishlog contains:
         0 WorkThread   - 0x47134c10 end
           The smoking gun... ?

 Thread 9 can not be the target
   even though wrq member being set to 0x9ae04b0
     it is in the wrong pool
     and ww->wrq == NULL
       so the wrq content is leftovers from last

 There is no sign anybody is holding the wq mutex (0x9aa3268)

 Yet Thread 11 seems to be stuck on it
    guess based on: "in __lll_mutex_lock_wait ()"

 I notice, that all condvars in the workers have "__nwaiters = 2"
   I would have expected a value of one, as the condvar is private to the
 worker thread.
     This may be a quirk of the pthreads implementation.

 Thread 3 and 4 look suspect, in that they seem to have the exact same
 condvar.
   The other private data seems to be truly private.
   This could also be a quirk of the pthreads implementation.
     If condvars are pooled
       I have a hard time seeing how resolution would work based on the
 recorded data
       This condvar also has __nwaiters = 2, should be 3 if the condvar was
 shared ?
         Suspect
           These threads are in the subject pool
     Or could just be accidentally that they have the same sequence#
       not unlikely
       And futex#
         somewhat unlikely
         And __align garbage
           somewhat unlikely
             not a credible hypothesis
   The two threads have last-used very close to each other: 0.215452 msec
 apart.

 All the worker-threads are idle.

 Queue orders seems to be:
 wq = 0x9aa3260
 4  vtqe_next = #3  0x49938c10, vtqe_prev = #H  0x9aa3290
 1221835916.0777061  wlb = 0x48f35c00
 3  vtqe_next = #5  0x48536c10, vtqe_prev = #4  0x48f37c60
 1221835916.077179   wlb = 0x49936c00
 5  vtqe_next = #6  0x47b35c10, vtqe_prev = #3  0x49938c60
 1221835916.0561609  wlb = 0x48534c00
 6  vtqe_next = #2  0x4a339c10, vtqe_prev = #5  0x48536c60
 1221835915.8512189  wlb = 0x47b33c00
 2  vtqe_next = #T  0x0,        vtqe_prev = #6  0x47b35c60
 1221835915.694943   wlb = 0x4a337c00

 wq = 0x9aa1180
 7  vtqe_next = #13 0x42b2dc10, vtqe_prev = #H  0x9aa11b0
 1221835916.218838   wlb = 0x46731c00
 13 vtqe_next = #8  0x45d32c10, vtqe_prev = #7  0x46733c60
 1221835916.167063   wlb = 0x42b2bc00
 8  vtqe_next = #10 0x44930c10, vtqe_prev = #13 0x42b2dc60
 1221835916.020401   wlb = 0x45d30c00
 10 vtqe_next = #9  0x45331c10, vtqe_prev = #8  0x45d32c60
 1221835915.992481   wlb = 0x4492ec00
 9  vtqe_next = #T  0x0,        vtqe_prev = #10 0x44930c60
 1221835915.676183   wlb = 0x4532fc00

 Varnishlog:

     0 WorkThread   - 0x42b2dc10 start = #13
     0 WorkThread   - 0x44930c10 start = #10
     0 WorkThread   - 0x45331c10 start = #9
     0 WorkThread   - 0x45d32c10 start = #8
     0 WorkThread   - 0x46733c10 start = #7
     0 WorkThread   - 0x47134c10 start
     0 WorkThread   - 0x47b35c10 start = #6
     0 WorkThread   - 0x48536c10 start = #5
     0 WorkThread   - 0x48f37c10 start = #4
     0 WorkThread   - 0x49938c10 start = #3
     0 WorkThread   - 0x4a339c10 start = #2
     0 WorkThread   - 0x47134c10 end

 }}}

 There seems to be no reason to hang on the mutex

 Two threads seem to share a condvar

 The picked thread seems to have been (in the process off) being decimated.

 None of this makes sense.

-- 
Ticket URL: <http://varnish.projects.linpro.no/ticket/324#comment:3>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator


More information about the varnish-bugs mailing list