[Varnish] #610: rushing too often may overflow session workspace
Varnish
varnish-bugs at projects.linpro.no
Wed Jan 13 09:17:29 CET 2010
#610: rushing too often may overflow session workspace
----------------------+-----------------------------------------------------
Reporter: slink | Owner: phk
Type: defect | Status: new
Priority: high | Milestone:
Component: varnishd | Version: 2.0
Severity: major | Resolution:
Keywords: |
----------------------+-----------------------------------------------------
Comment (by slink):
Hi Paul,
thanks for looking into this. Maybe I should clarify that this bug is
really about two different issues
1. HSH_rush being called too often (when the obj being waited for is not
necessarily ready)
2. Exhaustion of the session workspace when cnt_lookup->HSH_Prepare is
beging called to often.
2) is a consequence of 1).
But I agree that changing if (sp->obj == NULL) into if (sp->objhead ==
NULL) at the top of cnt_lookup would also make sense: The only way
sp->objhead ever gets set is when we wait for a busy object, and IIUC
waiting for a busy object is the only case when we reenter cnt_lookup.
(We probably could also check for sp->hashptr as that is being set in
HSH_Prepare)
I will test this suggestion, but this will probably take some more
time.
Having said that, I still think that my suggested change (which
targets issue 1) is an important fix/improvement even if it did not
have the consequence of workspace exhaustion:
* Waking up waiting sessions unnecessarily may lead to extremely high
peak load
In particular, HSH_Drop calls HSH_Deref (which would rush), so
whenever we decide or are being forced to give up in cnt_fetch, we
rush all the other waiters (which will end up waiting again).
On the production system I am working on, we've seen cases where,
with a slow backend, the load on varnish servers would suddenly
raise to the thousands and this scenario would explain the effect.
* I would like to have better control over restarts
Another change which I've made for production use (and which I still
need to document) is to increment sp->restarts in hsh_rush,
following the idea that a session which has waited for a busy object
has effectively been (internally) restarted.
At any rate, it has waited (probably quite a while) for the busy
object, so we might want to chose different parameters for the
second fetch (like increasing the grace time).
In order to get an exact figure for the restarts (with this new
semantics), I need to make sure that hsh_rush is only called when a
busy object becomes available.
Maybe it is of interest that my suggested changes are running in
production without any issues since january 4th.
To conclude, I think changing the test in cnt_lookup makes sense, but
I also think we still need the changes which I suggested.
I must say that I still haven't looked at the trunk because, at this
point, I need to focus on improving the stability of production
versions.
Nils
--
Ticket URL: <http://varnish.projects.linpro.no/ticket/610#comment:2>
Varnish <http://varnish.projects.linpro.no/>
The Varnish HTTP Accelerator
More information about the varnish-bugs
mailing list