Compared performance of Varnish Cache on x86_64 and aarch64

Fri Jul 31 13:43:11 UTC 2020

I am sorry for being so late to the game, but here it goes:

ons. 29. jul. 2020 kl. 14:12 skrev Poul-Henning Kamp <phk at phk.freebsd.dk>:
> Your measurement says that there is 2/3 chance that the latency
> is between:
>
>         655.40µs - 798.70µs     = -143.30µs
>
> and
>         655.40µs + 798.70µs     = 1454.10µs

No, it does not. There is no claim anywhere that the numbers are
following a normal distribution or an approximation of it. Of course,
the calculations you do demonstrate that the data is far from normally
distributed (as expected).

> You cannot conclude _anything_ from those numbers.

There are two numbers, the average and the standard deviation, and
they are calculated from the data, but the truth is hidden deeper in
the data. By looking at the particular numbers, I agree completely
that it is wrong to conclude that one is better than the other. I am
not saying that the statements in the article are false, just that you
do not have data to draw the conclusions.

Furthermore I have to say that Geoff got things right (see below). As
a mathematician, I have to say that statistics is hard, and trusting
the output of wrk to draw conclusions is outright the wrong thing to
do.

In this case we have a luxury which you typically do not have: Data is
essentially free. You can run many tests and you can run short or long
tests with different parameters. A 30 second test is simply not enough
for anything.

As Geoff indicated, for each transaction you can extract many relevant
values from varnishlog, with the status, hit/miss, time to first byte
and time to last byte being the most obvious ones. They can be
extracted and saved to a csv file by using varnishncsa with a custom
format string, and you can use R (used it myself as a tool in my
previous job - not a fan) to do statistical analysis on the data. The
Student T suggestion from Geoff is a good idea, but just looking at
one set of numbers without considering other factors is mathematically
problematic.

Anyway, some obvious questions then arise. For example:
- How do the numbers between wrk and varnishlog/varnishncsa compare?
Did wrk report a total number of transactions than varnish? If there
is a discrepancy, then the errors might be because of some resource
restraint (number of sockets or dropped syn packages?).
- How does the average and maximum compare between varnish and wrk?
- What is the CPU usage of the kernel, the benchmarking tool and the
varnish processes in the tests?
- What is the difference between the time to first byte and the time
to last byte in Varnish for different object sizes?

When Varnish writes to a socket, it hands bytes over to the kernel,
and when the write call returns, we do not know how far the bytes have
come, and how long it will take before they get to the final
destination. The bytes may be in a kernel buffer, they might be on the
network card, and they might be already received at the client's
kernel, and they might have made it all into wrk (which may or may not
have timestamped the response). Typically, depending on many things,
Varnish will report faster times than what wrk, but since returning
from the write call means that the calling thread must be rescheduled,
it is even possible that wrk will see that some requests are faster
than what Varnish reports. Running wrk2 with different speeds in a
series of tests seems natural to me, so that you can observe when (and
how) the system starts running into bottlenecks. Note that the
bottleneck can just as well be in wrk2 itself or on the combined CPU
usage of kernel + Varnish + wrk2.

To complicate things even further: On your ARM vs. x64 tests, my guess
is that both kernel parameters and parameters for the network are
different, and the distributions probably have good reason to choose
different values. It is very likely that these differences affect the
performance of the systems in many ways, and that different tests will
have different "optimal" tunings of kernel and network parameters.

Sorry for rambling, but getting the statistics wrong is so easy. The
question is very interesting, but if you want to draw conclusions, you
should do the analysis, and (ideally) give access to the raw data in
case anyone wants to have a look.

Best,
Pål

fre. 31. jul. 2020 kl. 08:45 skrev Geoff Simmons <geoff at uplex.de>:
>
> On 7/28/20 13:52, Martin Grigorov wrote:
> >
> > I've just posted an article [1] about comparing the performance of Varnish
> > Cache on two similar
> > machines - the main difference is the CPU architecture - x86_64 vs aarch64.
> > It uses a specific use case - the backend service just returns a static
> > content. The idea is
> > to compare Varnish on the different architectures but also to compare
> > Varnish against the backend HTTP server.
> > What is interesting is that Varnish gives the same throughput as the
> > backend server on x86_64 but on aarch64 it is around 30% slower than the
> > backend.
>
> Does your test have an account of whether there were any errors in
> backend fetches? Don't know if that explains anything, but with a
> connect timeout of 10s and first byte timeout of 5m, any error would
> have a considerable effect on the results of a 30 second test.
>
> The test tool output doesn't say anything I can see about error rates --
> whether all responses had status 200, and if not, how many had which
> other status. Ideally it should be all 200, otherwise the results may
> not be valid.
>
> I agree with phk that a statistical analysis is needed for a robust
> statement about differences between the two platforms. For that, you'd
> need more than the summary stats shown in your blog post -- you need to
> collect all of the response times. What I usually do is query Varnish
> client request logs for Timestamp:Resp and save the number in the last
> column.
>
> t.test() in R runs Student's t-test (me R fanboi).
>
>