<div dir="ltr">Hi,<div><br></div><div>I'm all for the LUT idea and I think it makes even more sense when you consider H/2.<div><br></div><div>Having the same timestamp during a particular method also makes sense.</div><div>I believe we only need to change now calls to use ctx->now instead of getting the time on each call but haven't looked into detail.</div><div><br></div><div>Being a bit smarter wrt some regular expressions might also be a good idea, specially for cases like the one you mentioned.</div><div>FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32. Perhaps it's worth checking if this provides any benefit before we consider optimizing the custom patterns. </div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell <span dir="ltr"><<a href="mailto:dho@fastly.com" target="_blank">dho@fastly.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

Probably best to bring this discussion out of IRC at this point. I've<br>

looked into Varnish performance (from CPU consumption perspective) on<br>

our installations for nearly 3 years, and wanted to share some<br>

findings (as well as some thoughts on how to address these things).<br>

Unfortunately no TL;DR, but maybe it's good to separate these topics.<br>

<br>

It's going to be a bit before I'm going to have time to do a Varnish 4<br>

test to get numbers -- our fork is based on 2.1.4 and I things do not<br>

line up 1:1. Furthermore, getting noise-free numbers for just Varnish<br>

is difficult for a number of reasons, but effectively I cannot get a<br>

good sample of an individual process due in part to how perf performs<br>

and aggregates its samples, and in part to how many threads we run.<br>

But here are some of the things I've seen as problematic over the last<br>

few years, and ideas for fixing them.<br>

<br>

## Header Scanning<br>

<br>

The number one CPU hog in Varnish right now outside of PCRE is header<br>

processing. Any header lookup or set is a linear scan, and any<br>

operation on a non-existent header has worst-case time. Any headers<br>

added with VCL become worst-case. Our plan for this turns header<br>

access / modification into O(1):<br>

<br>

 a. All headers are accessed from offsetting inside a LUT.<br>

 b. All headers in http_headers.h have a guaranteed slot in the LUT.<br>

 c. VCL-accessed headers are managed at vcc-time and receive slots in<br>

the LUT at initialization-time.<br>

 d. VMODs have an interface such that headers they access outside of<br>

headers defined in http_headers.h or VCL can be registered at<br>

initialization time.<br>

<br>

This also provides a means for accessing multiple same-named headers.<br>

We would introduce some VCL syntax to be able to access a specific<br>

header (e.g. beresp.http.Set-Cookie[2]), to get the number of<br>

occurrences of a header of a particular name. An interface would also<br>

exist to be able to apply some function to all headers (possibly also<br>

to all headers matching a specific name). The latter of these is<br>

something we have already -- we've added a "collection" type to VCL,<br>

as well as some functions to apply a function (called a "callback") to<br>

members of the collection. Callbacks operate in a context in which<br>

they are provided a key and value; they are not closures. This has<br>

workspace overhead that I'm not entirely happy with yet, so we have<br>

not made it a generally accessible thing yet.<br>

<br>

In the case of multiple headers with the same name, the storage format<br>

would still be a LUT, but the next member for a header would appear<br>

"chained" in some later part of the headers, any offset is defined as<br>

a 3-tuple of (p, q, next). When next is NULL, only one header of a<br>

particular instance appears. Since Varnish bounds the number of<br>

headers that can be handled in a request, this table doesn't have to<br>

be very large and can probably be bounded to (n_known_headers +<br>

max_headers) entries.<br>

<br>

## Time Storage as double-precision FP and printf(3)-family<br>

<br>

Varnish uses double-precision FP for storing timestamps. The rationale<br>

for this is reasonable: a native type exists that can support<br>

fractional seconds. Arithmetic between two timestamps can be easily<br>

applied in code without relying on APIs that make said arithmetic<br>

difficult to read. This is a good argument for having times stored in<br>

a native format. Unfortunately, there are a few downsides: FP<br>

operations are typically slow, error-prone at a hardware / compiler<br>

level (<a href="https://github.com/varnishcache/varnish-cache/issues/1875" rel="noreferrer" target="_blank">https://github.com/varnishcache/varnish-cache/issues/1875</a> as a<br>

recent example), and stringifying floating point numbers correctly is<br>

really hard.<br>

<br>

I have just done a measurement of our production Varnish and TIM_foo<br>

functions no longer appear as significant CPU users. I believe this is<br>

because of a change I made a year or two ago in our fork that<br>

snapshots timestamps before calling into VCL. All VRT functions<br>

operate on the same timestamps, and therefore all VCL callbacks appear<br>

to occur at the same time. (This has numerous beneficial properties<br>

and only a few negative ones). Each VCL function gets its own<br>

snapshot.<br>

<br>

However, printf(3)-family functions are still super-heavy CPU<br>

consumers, accounting ~6.5% of total CPU time in our Varnish. A third<br>

of this time is spent in `__printf_fp`, which is the glibc function<br>

that handles representation of floating-point values. The *only* thing<br>

Varnish really uses FP for is doubles; it's logical to assume without<br>

doing a full audit is that something like 20% of printf(3)-family time<br>

is spent converting double-precision numbers to strings and the<br>

majority of the remaining time is format string parsing. From this<br>

perspective, it is still worth it to analyze the performance of<br>

VTIM-family functions to get an idea of their overhead:<br>

<br>

 1. TIM_real in our tree showed up in top functions of a synthetic,<br>

all-hit workload. Annotating the function shows where samples saw most<br>

of the time spent. In this sample, we can see that nearly 2/3ds of the<br>

time is spent in setting up the call stack and 1/3 of the time is<br>

doing FP-ops.<br>

<br>

      │    000000000000aec0 <TIM_real>:<br>

 36.20 │      sub    $0x18,%rsp<br>

 26.70 │      xor    %edi,%edi<br>

       │      mov    %rsp,%rsi<br>

...<br>

  1.81 │40:   cvtsi2 0x8(%rsp),%xmm1<br>

 10.86 │      mulsd  0x8511(%rip),%xmm1        # 13420 <__func__.5739+0x13><br>

 19.00 │      cvtsi2 (%rsp),%xmm0<br>

<br>

 2. Other time-related functions have FP-dominating components.<br>

TIM_format, for example, is dominated by nearly 2/3rds of its time on<br>

a single FP instruction (this is probably partially due to pipeline<br>

stalling).<br>

<br>

       │    000000000000ae40 <TIM_format>:<br>

 62.16 │      cvttsd %xmm0,%rax<br>

<br>

Inlining these functions would be beneficial in the TIM_real (sorry, I<br>

am still operating in V2 terminology, but I believe all of this still<br>

applies to V4) sense, but moving away from double as a time storage<br>

format would be beneficial in general. This would be done by using<br>

64-bit counters that represent the number of nanoseconds since the<br>

epoch. We will run out of those in something like 540 years, and I'm<br>

happy to make that someone else's problem :).<br>

<br>

 a. It reduces significant portion of overhead in VTIM-family functions<br>

 b. It reduces significant portion of overhead in printf<br>

 c. It maintains composability of time arithmetic<br>

<br>

The major downside is that timestamp printing now needs additional<br>

work to print fractional time components.<br>

<br>

Finally, this gets a little into printf(3)-family inefficiencies as<br>

well. Because it parses format strings every time, we've optimized a<br>

number of places where we were using sprintf(3)-like interfaces to<br>

simply use string buffers. There is VSB of course, but we also use<br>

<a href="https://github.com/dhobsd/vstring" rel="noreferrer" target="_blank">https://github.com/dhobsd/vstring</a> (partially for FP stuff, partially<br>

for allowing static-backed buffers to upgrade to dynamic ones if<br>

necessary). The code overhead of string building is unfortunate, but<br>

at 6.5% overhead to use printf(3), this is a real win. (Some of the<br>

unlabeled blocks are things like _IO_default_xsputn, so the overhead<br>

of printf(3) here is likely still higher than 6.5%). See<br>

<a href="https://9vx.org/images/fg.png" rel="noreferrer" target="_blank">https://9vx.org/images/fg.png</a> -- this was taken on a machine that is<br>

handling nearly 12k RPS on top of ~3-4k threads. By moving to integer<br>

times, conversion and printing would likely reduce the overhead of<br>

printf(3) by 20% without actually changing consumption of printf.<br>

<br>

I am unclear how this applies to Varnish 4, but I think relatively<br>

little is changed in this context between the versions.<br>

<br>

## PCRE<br>

<br>

There are other things we've done (like optimizing regexes that are<br>

obviously prefix and suffix matches -- turns out lots of people write<br>

things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if<br>

(strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't<br>

see those as being as high priority for upstream; they're largely<br>

issues for our multi-tenant use case. We have done this already;<br>

another thing we would like to do is to check regexes for things like<br>

backtracking and use DFA-based matching where possible. In the flame<br>

graph screenshot, the obvious VRT functions are PCRE.<br>

<br>

## Expected Impact<br>

<br>

The expected impact of fixing these things is almost purely in<br>

latency. For this machine handling 12k RPS, that is the constant<br>

throughput bound, but we are bursting up to nearly 4k threads to serve<br>

the load. If header processing, PCRE, and printf were reduced to 50%<br>

of their current overhead, we'd expect to be able to handle the same<br>

load with something like 350 fewer threads, which is a real win for<br>

us. Note than even our 99%ile latency is largely covered by cache<br>

hits, so these effects would improve service for the vast majority of<br>

requests.<br>

<br>

Anyway, those are some thoughts. Looking forward to comments, though<br>

maybe there's a better venue for that than this ML?<br>

<br>

--dho<br>

<br>

_______________________________________________<br>

varnish-dev mailing list<br>

<a href="mailto:varnish-dev@varnish-cache.org">varnish-dev@varnish-cache.org</a><br>

<a href="https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev" rel="noreferrer" target="_blank">https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev</a></blockquote></div><br></div></div></div></div>