1. 70
    1. 3

      I know very little about cache replacement algorithms. Can somebody explain why, when evicting from S, it can evict more than one element, even though it only needs to make space for one element?

      EDIT: I’m doubly confused, now. The pseudocode algorithm for evictS seems to show that multiple elements can be evicted from S at once, but the animation does not display this behaviour.

      1. 3

        There’s a bug in the pseudocode; the line evicted ← true should be after the following end if.

        1. 1

          In that case, there is no need for the evicted variable at all, or even a loop. The S.size > 0 condition would also become unnecessary, because EvictS is only being called when the cache is full.

          So this line of pseudocode:

          while not evicted and S.size > 0 do

          … could be completely eliminated, as well as the evicted variable. This would make the code much simpler.

          Edit: Simplified the code even further.

      2. 2

        Okay, I didn’t read their code well enough. Their pseudocode for evictS() does match their simulation code and their py-cachemonCache implementation: evictS() will pop items from the small fifo until it is empty or it evicts an item into the ghost queue G. Any item it sees during this process that can be moved to the main queue M will be moved into that queue.

        I think they did it like that because their code models S and M as sharing a single buffer, so “moving” elements from S to M never causes M to evict anything because S shrinks and M grows.

        They should probably change their pseudocode so that evictS() evicts 0 or 1 items and leave the shared buffer thing as an implementation detail.

        Edit: removed incorrect claim about a bug in their python code. It’s not buggy, just confusing!

        1. 2

          I think they did it like that because their code models S and M as sharing a single buffer, so “moving” elements from S to M never causes M to evict anything because S shrinks and M grows.

          But that still has a really important side effect: it means that the eviction has forced an element to go into the ghost queue Q much sooner than it would otherwise be. Elements in the ghost queue only have metadata, so they become cache misses.

          Other elements that remain at the tail of S could also be evicted sooner than they would otherwise be, depending on the implementation.

          That means that they wouldn’t have as much time to have their counter increased and moved into M, like they would otherwise.

          This also means that S could become much smaller than the 10% that they have defined (depending on the implementation). And they even mention this as being a problem:

          When S is very small, popular objects do not have enough time to accumulate a hit before being evicted, so the precision is low

          Furthermore, the pseudocode does not seem to be allowing for the size of S to increase. So it can always decrease in size, but never increase (except when filling the queues in the beginning). In most cases, given enough time, S will always become a queue with size 1.

          They also mention that they have tried adapting the size of the queues but have reached the conclusion that it’s not a good idea:

          We have also experimented with using an adaptive algorithm to change the size of FIFO queues in S3-FIFO. The results show that using an adaptive algorithm can improve the tail performance, but degrade overall performance. We find that tuning the adaptive algorithm is very challenging.

          Note that, like you alluded to, they also mention using just one queue for both S and M:

          Because objects evicted from S may enter M, they can be implemented using one queue with a pointer pointed at the 10% mark.

          However, as you can see in the above quote, even when using just one queue for both S and M, the relative sizes of S and M are not supposed to change. They are always supposed to remain at 10% / 90% (except at the beginning, when it would be 100% / 0% and then gradually settle into the correct ratio as the cache is filled, I assume).

          They should probably change their pseudocode so that evictS() evicts 0 or 1 items and leave the shared buffer thing as an implementation detail.

          Agreed, I think this is what makes the most sense, when considering everything that they have written in the article.

          But I wonder if they ever benchmarked such an implementation, i.e. one that evicts only 0 or 1 items from S?

          I would assume so, but your description of the actual simulation code and the python code seems to be implying that they didn’t?

          Edit: clarified some points.

          1. 2

            But that still has a really important side effect: it means that the eviction has forced an element to go into the ghost queue Q much sooner than it would otherwise be. Elements in the ghost queue only have metadata, so they become cache misses.

            That’s a good point. And also your point about how the implemented and described algorithm allows less time for each element in S to accumulate hits.

            In most cases, given enough time, S will always become a queue with size 1.

            The pseudocode doesn’t specify enough for us to assume that, I think, and that’s not how the implementations work. On each insert the combined size of the new item, S and M are compared to the total cache size and if they’re over the limit then there’s a loop like this:

            total_mem_limit = target_mem(M) + target_mem(S)
            while used_mem(new_item) + used_mem(S) + used_mem(M) > total_mem_limit:
                if used_mem(M) > target_mem(M) or isempty(S):
                    evictM()
                else:
                    evictS()
            

            Values will only be evicted from S if M is already less than or equal to its target size. This means that there is always enough space in S for each new item (so long as the item is no bigger than target_mem(S), in which case the program crashes, I think). This rule also means that the combined size of S and M is always less than the total target.

            It’s kinda complicated and I don’t see a good reason for it, but maybe there is one.

            I implemented S3-FIFO in python in the way that I expected it to work (with at most two evictions per insert (S and M or S and Q)). The S3FIFO code is 123 lines including blanks, docs and comments. The rest of the file is some other cache implementations and tests to compare their performance. You’ll need scipy to run the tests.

            I didn’t want to work out how to get their real world datasets to work, but they mentioned that requests keys are sort of Zipf distributed, so I generated synthetic data like that. I compared S3FIFO with a simple FIFO and python’s functools.lru_cache, neither of which are state of the art. In my tests, S3FIFO has a strong relative advantage when the cache is very small and a slight relative disadvantage when the cache is large.

            1. 1

              In most cases, given enough time, S will always become a queue with size 1.

              The pseudocode doesn’t specify enough for us to assume that, I think

              What do you feel is missing?

              From what I can see, and as @Moonchild has keenly observed, for each Insert(x) call, EvictS() can evict either one or multiple items from the S queue, but at most one item is inserted into S.

              I guess whether EvictS() is called depends on what “cache is full” means, but the way I understand it is that it means the total cache size (including the new item and the S and M queues) is equal or greater than the target size, otherwise the memory usage of the cache could go way above the target, I think.

              Which means that after the cache has already been filled, EvictS() would be called on every insertion if we assume that all the items have the same size and no items go into G (yes, the latter is a big assumption, but still, it’s not impossible…).

              I’m only considering the pseudocode as written, not the actual implementations. That said, if the pseudocode were fixed to only evict one item from the S queue in EvictS() then I think everything would work fine.

              Values will only be evicted from S if M is already less than or equal to its target size.

              I see… Unfortunately, that’s not how the pseudocode seems to be written, so I guess there are significant differences between the pseudocode and the implementations…

              It’s kinda complicated and I don’t see a good reason for it, but maybe there is one.

              Looking at the code you provided, it seems to me like they wrote the code in a way that the sizes of S and M can be flexible, or rather, change dynamically? Perhaps this is an artifact of them playing around with the dynamic adjustment of queue sizes (like in the ARC algorithm), which they said they did in the article. Although it’s hard to tell for sure.

              The S3FIFO code is 123 lines including blanks, docs and comments.

              That’s really cool. It’s amazing how such a simple algorithm seems to be so good.

              Edit: clarified the assumptions.

              1. 2

                I see… Unfortunately, that’s not how the pseudocode seems to be written, so I guess there are significant differences between the pseudocode and the implementations…

                Yes.

                Looking at the code you provided, it seems to me like they wrote the code in a way that the sizes of S and M can be flexible, or rather, change dynamically? Perhaps this is an artifact of them playing around with the dynamic adjustment of queue sizes (like in the ARC algorithm), which they said they did in the article. Although it’s hard to tell for sure.

                Yes, their implementation allows S and M to change size, which is surely useful at startup because the algorithm will fill the full cache before evicting anything rather than (in my implementation) filling only the first 10% before potentially evicting values to the ghost fifo. I think this is why my implementation has poor performance when the cache size is large relative to the input.

                It’s too late to explain, but I think I understand their algorithm and their while loops now. I’ll maybe implement it tomorrow and compare it to version I wrote today. If I can be arsed I’ll make a PR to the author’s blog to fix their pseudocode, cos it is definitely wrong.

                1. 2

                  This is the author. Thank you for the clarifications! :)

                  Great work on re-implementing and evaluating the S3-FIFO algorithm! As you mentioned, it is important to avoid evicting from S when the cache is not full (especially for benchmarks with small traces), It should be just a few lines of change to your current implementation. Let me know what the results look like after the change.

                  1. 2

                    You’re welcome, and thanks to you and your colleagues for publishing interesting research :)

                    I found your HotOS paper today and that was very helpful understanding why the algorithm does what it does.

                    I’ve adjusted my implementation and think it is now pretty close to the version in libcacheSim and performs better on larger cache sizes. My new benchmark results are in the readme along with my explanation of the algorithm, which could of couse be wrong too :)

                    I wrote a few variations, too:

                    • S3FIFO3 is a version where evictS() processes at most one item.
                    • S3FIFO4 is a version where evictS() either evicts 1 item to G and promotes none or promotes 1 or more items and evicts none
                    • EagerPromotionS3FIFO is my old implementation where the FIFOs are fixed size so promotion/eviction is eager.

                    On my data (keys are integers sampled from the Zipf distribution): S3FIFO3 has identical performance to S3FIFO; S3FIFO4 is consistently a little worse; EagerPromotionS3FIFO is bad at large cache sizes. I didn’t do significance testing because I’m a hack and a fraud.

                    1. 2

                      Super cool work! I think the new implementation looks good. And the discussions here are very helpful for me to update the pseudo-code.

              2. 2

                There was a typo in the original pseudo-code (algo 3) as pointed out and I have fixed it now. Thank you!

                From what I can see, and as @Moonchild has keenly observed, for each Insert(x) call, EvictS() can evict either one or multiple items from the S queue, but at most one item is inserted into S. It is true that at most one item is inserted into S during the current request, but more items will be inserted into S later. So S will keep 10% size over the long run. However, some items may get evicted slightly earlier and S may become less than 10% during the loop as pointed out by you and @Moonchild. I have not thought about this before. Nice catch! In terms of the impact, my guess is not in most cases where the cache size is not too small, this does not cause a problem.

                I agree that evicting 0 or 1 item from the cache each time will make it easier to understand and follow the 10% semantics more closely. I will try this out to make sure there is no other issue, and I am happy to make the changes. BTW, I am happy to accept any change that will make the post more clear,

                1. 1

                  I think the pseudocode still doesn’t match the libcacheSim implementation. The core issue is this discrepency:

                  • evictS() in the pseudocode can call evictM() as many times as it has promotable items if M is full.
                  • evict_fifo() in libcacheSim never calls evict_main()

                  Personally, I really doubt that you mean for evictS() to be able to evict more than one element from the cache per call.

                  Here’s how an insert can trigger an eviction in the pseudocode and in libcachesim:

                  • insert(x) in the pseudocode calls evictS() if the cache is full.
                  • In the libcacheSim implementation, evict() is called if the cache is full. evict() will call evict_main() if the size of M is greater than its target size, else evict_fifo()

                  So, on insert when the cache is full and M is at its target size, the pseudocode will promote 0-many items from S to M, and evict the same number of items from M, and then also possibly demote/evict an item from S to G. Worst case (M at target size (90%), S at target size (10%), all but the one value in S promotable to M), I think calling evictS() would evict 10% of the cache.

                  By contrast, libcacheSim will only ever evict one item (if all requests are constant size):

                  • if M is greater than its target size, then call evict_main() and evict one item from M
                  • if M is less than its target size, then call evict_fifo() and promote 0-many items to M (this does not cause any evictions from M) and possibly demote/evict one item to G. If we didn’t demote/evict anything, evict() will be called again and M will be greater than target size this time, so evict_main() is called and one item is evicted from main.
      3. [Comment removed by author]

        1. 1

          That’s even more confusing! The python implementation is different too—when it moves an element from S to M, it will abort straightaway only if M is full; if M has space, it will continue looping. Admittedly, in the steady state of the cache, M will always be full, so perhaps it doesn’t make much practical difference, but it still seems rather strange.

          1. 1

            I didn’t read the code carefully enough the first time. I think my new comment explains the situation better.

    2. 3

      I really wish the image could be paused/ rewound + if it had textual annotations.

      1. 4

        will make the change (in the next few weeks). Thank you for the suggestion! :)

    3. 2

      The “one hit wonder” observation seems consistent with the “most objects are short-lived” observation that serves as a basis for generational garbage collectors.

      I would be a bit surprised that the literature would not already feature such “generational caches”.

      I would also like to see a benchmark without the “eviction on collision” behaviour because it is not always a permissible behaviour for the cache

      1. 3

        In ‘a CPU is a compiler’, I mention the (admittedly somewhat weak) correspondence I observed between QLRU cache replacement and generational garbage collection.

        EDIT: an amusing fact: LRU caches are often cited as an example of a real-world application which doesn’t follow the generational hypothesis (and can therefore foil generational GCs).

      2. 2

        The “one hit wonder” observation seems consistent with the “most objects are short-lived” observation that serves as a basis for generational garbage collectors.

        The prose there talks about CDN requests, which I would expect to be closer to L2 / L3 caches because you request an entire object from them and then process it, whereas L1 caches in CPUs are things that you request individual object fields from and process them in registers.

        Most high-performance cache architectures that I’ve seen have different replacement policies for L1 vs lower-level caches. Even the L1 isn’t really the top-level cache because the store unit does write combining to try to produce entire cache lines of data, so that you can avoid a read if you are storing. Intel caches to exciting dances where (oversimplifying a bit) they have a few different cache replacement policies that are tuned for different workloads and have some that use only one and then switch the others depending on which ones are getting the higher hit rate.

        The correspondence between caches and young generations was one of the motivations for Sun Labs’ Project Maxwell, where they did the young generation collection in hardware in the cache and had barriers that let you do old-generation collection in software and promote objects to the old generation when they were evicted from cache.

        1. 2

          Do you have more info about Sun Labs’ Project Maxwell?

          I tried a quick Google, but the information was very very superficial.

          1. 2

            There’s a long tech report but I was given a copy privately and don’t feel comfortable sharing it. I’ll see if I can persuade the authors to put it online somewhere. They had a mix of good and bad ideas (as with most research projects, mine includes) but the good ideas are really interesting. I keep pondering how to compose them with CHERI.

    4. 2

      After reading this paper, I tried to quantify how often Bélády’s Anomaly happens, and whether it’s a real concern for system builders: https://brooker.co.za/blog/2023/06/23/belady.html (Spoiler: no, it’s not).

      For those unfamiliar with Bélády’s Anomaly:

      Running on a paging machine and using the FIFO replacement algorithm, there are instances when the program runs faster if one reduces the storage space allotted to it.

      Which is very weird, and very surprising.

    5. 2

      Does anyone know if the paper is publicly available?

      The only link to the paper that I was able to find does not seem to work.

      1. 3

        There is a similar paper from the same author: https://junchengyang.com/publication/hotos23-qdlp.pdf

      2. 3

        https://s3fifo.com/ says “To Appear at SOSP” which seems to be scheduled for October. https://sosp2023.mpi-sws.org/

        So I guess paper will be published then?