1. 14

  2. 8

    Though (as the note now at the top of the post says) this turned out not to show what the author initially thought, it points out two important things that are often not emphasized in data-structures courses:

    • Sometimes, space is speed. In this case, cache effects favor the smaller structure (see the full thread). The same thing can apply at other scales, too: fitting your database’s working set in RAM (or improving locality of access) can be a speed boost in database applications, or fitting your data on SSD, for instance.

    • Generally, big-O predicts how complexity increases as data grows very large, but it doesn’t mean performance will follow the predicted curve at small scales, and doesn’t tell you everything about practical performance. Again, storage performance is an obvious case–a B-tree and a binary tree have the same big-O but a B-tree with large fan-out will be faster to access on disk (roughly: because it takes fewer disk seeks), and likewise re: conventional sort algorithms vs. tuned external sorting, when sorting data that won’t fit in RAM.

      Besides data-access times, performance can differ from expectations because of things like surprise issues with branch prediction and the many issues once you get into distributed and concurrent systems. And sometimes the sort of performance you are thinking of isn’t the same as scalability which isn’t necessarily the same as what the user perceives as “fast” or “slow”. The real world is tricky!

    Although the student’s first conclusion was wrong, it’s commendable that they actually did some empirical digging, and it points out gaps in how we introduce students to data structures and algorithms as much as any mistakes by the student.

    1. 3

      Insightful. Yes, I actually think this is a pretty useful discussion to be having, quite regardless of the article not being written from a position of knowledge. Good links; it would be wonderful if more introductory courses would spend time on these topics. (I’m hopeful that the best ones already do.)

      1. 2

        Sean Parent had an interesting and relevant observation in his presentation at CppCon.

        There’s a 200x speed difference between L1 cache and main memory, but log2(1000000000000) is only ~40. In other words, cache and memory access times can have a significantly larger impact on performance than a factor of log n in the Big O of an algorithm.

        1. 1

          Cache associativity is interesting to understand and can certainly be demonstrated, but it tends to be less of a problem compared to the other issues discussed in this article. It is certainly not something that should be at the forefront of your mind as you write programs.

          From the cache effects link you posted.

          1. 1

            Right, he’s talking about about a single effect called cache associativity (your quote starts “Cache associativity is interesting…”), not about the effects of caches in general.

            Associativity relates to a trick chip makers use to efficiently implement their caches: each memory address can only be in one of a few cache slots, not just anywhere in the cache. But it means caches can’t hold particular sets of addresses at the same time, even when all the values are recently accessed and would make sense to store. As the author says, that’s responsible for the blue lines here, which represent lower cache hit rates when you’re jumping through an array by certain “bad” distances in an otherwise cacheable workload. It’s kind of a corner case for that to happen, so not front-of-mind.

            You more often want to worry about whether you can make your memory accesses more cacheable by, e.g., shrinking your data or replacing random accesses with sequential ones, to avoid the penalites for uncacheable accesses that he also graphed.

        2. 2

          Are there other maps where you can show are O(1) in this way in practice?

          I’ll admit to laziness about this kind of checking but my impression lately has become that “O(1)” for maps/hashlist is obtained through an assumption that the expected size of the set to store remains constant. And as far as ignorant me can see, with that definition, lots of data structure which are called O(log(n)) could be called O(1). I assume I’m missing something but I’d love to see what. Or maybe I’m not and the story is that hash table are classified this way “because no one expects people to use them on an ever-growing set, silly” except that I often hear “use a hash table, faster than a tree”.

          “In the simplest model, the hash function is completely unspecified and the table does not resize.” – Meaning, you’re not adding more and more over time? I assume I’m missing something.


          1. 4

            “In the simplest model, the hash function is completely unspecified and the table does not resize.” – Meaning, you’re not adding more and more over time? I assume I’m missing something.

            That section goes on to say “Adding rehashing to this model is straightforward”. Rehashing, in this case, is growing the hashtable. Then the time it takes to grow the structure seems like it would keep inserts from being O(1), but one way of getting around that is what the section calls “geometric resizing”, i.e., you double the size of the hashtable each time you grow it. As the hashtable grows, the resizes become more expensive proportional to the size, but also less frequent proportional to the size. Those effects cancel out, leading to the “amortized” (average) O(1) cost.

            That general type of analysis touched on in the Wikipedia page on dynamic arrays and I tried to talk about it in a StackOverflow answer on Go’s append.

            Separately, there are clever ways to spread out the work of growing the hashtable to avoid long pauses, but that’s a distinct problem from achieving amortized O(1).

            Are there other maps where you can show are O(1) in this way in practice?

            Interestingly, lots of things that are O(1) in a comp-sci analysis might not look O(1) on the sort of benchmark that the Medium post’s author did. The short answer is, cache effects get in the way: in this case, memory accesses limited to a compact, cacheable region are faster than ones that are more spread out, because they’re served out of the processor’s L2 rather than slower main memory. That makes the graph look not-very-linear as the structures outgrow cache.

            This is a sort-of-summary-of/slightly different spin on the Reddit thread linked to from the update the author put atop the post, which is worth reading.

            1. 3

              This is written by a person that basically admits to his ignorance in his first sentence. Your link pretty much says it: the table not resizing is in regards to collisions. As the map fills up, the hash function will have collisions (and there are ways of dealing with these) but there exists a time when you should increase the size of the table and rehash everything. For analysis purposes, this could have been neglected and thus saying the simplest model.

              1. 2

                I want to object to your tone here. Yes, I think clearly say that I’m asking a question that I don’t know the answer to. Your sentence construction makes it sound like I’m making some ignorant claim.

                Unfortunately, you don’t answer my question but instead repeat a few things I’ve heard before. I still don’t know exactly how a set of continually increasing size can be accessed or insert in constant with the assumption that the hash table has a constant size.

                1. 7

                  I think he’s referring to the author of the linked article (who leads off by saying he’s a student), not you, but yeah, generally better to talk about the ideas than the people.

                  1. 3

                    I was talking about the original poster, not you.

                    The way it is O(1) is because the hash function generates the index to look up.

                    So if we have a hash table H that has a member ‘data’, when we go like:

                    H[“mykey”] it becomes H.data[hash(“mykey”)] so the access is the constant array access. Now in the case of collisions, where the hash function generates identical outputs for different input you have to deal with it in other ways.