Changes between Version 5 and Version 6 of ArchitectNotes

2008-08-08 17:48:37 (8 years ago)

Added back text that was accidently deleted earlier


  • ArchitectNotes

    v5 v6  
    4545So now squid has the object in a page in RAM and written to the disk two places: one copy in the operating systems paging space and one copy in the filesystem. 
    47 Squid now use 
     47Squid now uses this RAM for something else but after some time, the HTTP object gets a hit, so squid needs it back. 
     49First squid needs some RAM, so it may decide to push another HTTP object out to disk (repeat above), then it reads the filesystem file back into RAM, and then it sends the data on the network connections socket. 
     51Did any of that sound like wasted work to you ? 
     53Here is how Varnish does it: 
     55Varnish allocate some virtual memory, it tells the operating system to back this memory with space from a disk file.  When it needs to send the object to a client, it simply refers to that piece of virtual memory and leaves the rest to the kernel. 
     57If/when the kernel decides it needs to use RAM for something else, the page will get written to the backing file and the RAM page reused elsewhere. 
     59When Varnish next time refers to the virtual memory, the operating system will find a RAM page, possibly freeing one, and read the contents in from the backing file. 
     61And that's it.  Varnish doesn't really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job. 
     63Varnish also only has a single file on the disk whereas squid puts one object in its own separate file.  The HTTP objects are not needed as filesystem objects, so there is no point in wasting time in the filesystem name space (directories, filenames and all that) for each object, all we need to have in Varnish is a pointer into virtual memory and a length, the kernel does the rest. 
     65Virtual memory was meant to make it easier to program when data was larger than the physical memory, but people have still not caught on.  
     67== More caches. == 
     69But there are more caches around, the silicon mafia has more or less stalled at 4GHz CPU clock and to get even that far they have had to put level 1, 2 and sometimes 3 caches between the CPU and the RAM (which is the level 4 cache), there are also things like write buffers, pipeline and page-mode fetches involved, all to make it a tad less slow to pick up something from memory. 
     71And since they have hit the 4GHz limit, but decreasing silicon feature sizes give them more and more transistors to work with, multi-cpu designs have become the fancy of the world, despite the fact that they suck as a programming model. 
     73Multi-CPU systems is nothing new, but writing programs that use more than one CPU at a time has always been tricky and it still is. 
     75Writing programs that perform well on multi-CPU systems is even trickier. 
     77Imagine I have two statistics counters: 
     79        unsigned    n_foo; 
     80        unsigned    n_bar; 
     83So one CPU is chugging along and has to execute {{{n_foo++}}} 
     85To do that, it read n_foo and then write n_foo back.  It may or may not involve a load into a CPU register, but that is not important. 
     87To read a memory location means to check if we have it in the CPUs level 1 cache.  It is unlikely to be unless it is very frequently used.  Next check the level two cache, and let us assume that is a miss as well. 
     89If this is a single CPU system, the game ends here, we pick it out of RAM and move on. 
     91On a Multi-CPU system, and it doesn't matter if the CPUs share a socket or have their own, we first have to check if any of the other CPUs have a modified copy of n_foo stored in their caches, so a special bus-transaction goes out to find this out, if if some cpu comes back and says "yeah, I have it" that cpu gets to write it to RAM.  On good hardware designs, our CPU will listen in on the bus during that write operation, on bad designs it will have to do a memory read afterwards. 
     93Now the CPU can increment the value of n_foo, and write it back.  But it is unlikely to go directly back to memory, we might need it again quickly, so the modified value gets stored in our own L1 cache and then at some point, it will end up in RAM. 
     95Now imagine that another CPU wants to {{{n_bar+++}}} at the same time, can it do that ?  No.  Caches operate not on bytes but on some "linesize" of bytes, typically from 8 to 128 bytes in each line.  So since the first cpu was busy dealing with {{{n_foo}}}, the second CPU will be trying to grab the same cache-line, so it will have to wait, even through it is a different variable. 
     97Starting to get the idea ? 
     99Yes, it's ugly. 
     101== How do we cope ? == 
     103Avoid memory operations if at all possible. 
     105Here are some ways Varnish tries to do that: 
     107When we need to handle a HTTP request or response, we have an array of pointers and a workspace.  We do not call malloc(3) for each header.  We call it once for the entire workspace and then we pick space for the headers from there.  The nice thing about this is that we usually free the entire header in one go and we can do that simply by resetting a pointer to the start of the workspace. 
     109When we need to copy a HTTP header from one request to another (or from a response to another) we don't copy the string, we just copy the pointer to it.  Provided we do not change or free the source headers, this is perfectly safe, a good example is copying from the client request to the request we will send to the backend. 
     111When the new header has a longer lifetime than the source, then we have to copy it.  For instance when we store headers in a cached object.  But in that case we build the new header in a workspace, and once we know how big it will be, we do a single malloc(3) to get the space and then we put the entire header in that space. 
     113We also try to reuse memory which is likely to be in the caches. 
     115The worker threads are used in "most recently busy" fashion, when a workerthread becomes free it goes to the front of the queue where it is most likely to get the next request, so that all the memory it already has cached, stack space, variables etc, can be reused while in the cache, instead of having the expensive fetches from RAM. 
     117We also give each worker thread a private set of variables it is likely to need, all allocated on the stack of the thread.  That way we are certain that they occupy a page in RAM which none of the other CPUs will ever think about touching as long as this thread runs on its own CPU.  That way they will not fight about the cachelines. 
     119If all this sounds foreign to you, let me just assure you that it works: we spend less than 18 system calls on serving a cache hit, and even many of those are calls tog get timestamps for statistics. 
     121These techniques are also nothing new, we have used them in the kernel for more than a decade, now it's your turn to learn them :-) 
     123So Welcome to Varnish, a 2006 architecture program. 
     125Poul-Henning Kamp, 
     126Varnish architect and coder.