1. 44
  1. 24

    They’re ordered but considered equal with different orders, unlike an OrderedDict which is an interesting property.

    1. 15

      It would be crazy not to. Adding guaranteed ordering is purely adding a property (backwards compatible). But changing equality would be removing a property, which is backwards breaking. The migration path would be obnoxious as hell. Glad they didn’t.

      1. 3

        Indeed. It is sensible. The first question I asked myself after seeing this headline was “Why OrderedDict anymore?” and it turns out this was a thing I instantly discovered. Of course the docs for OrderedDict do include more than that as a difference:

        • The regular dict was designed to be very good at mapping operations. Tracking insertion order was secondary.
        • The OrderedDict was designed to be good at reordering operations. Space efficiency, iteration speed, and the performance of update operations were secondary.
        • Algorithmically, OrderedDict can handle frequent reordering operations better than dict. This makes it suitable for tracking recent accesses (for example in an LRU cache).
        • The equality operation for OrderedDict checks for matching order.
        • The popitem() method of OrderedDict has a different signature. It accepts an optional argument to specify which item is popped.
        • OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint.
      2. 1

        So, A == B implies f(A) == f(B) goes out of the window for one of the fundamental data structures just like that, then.

      3. 11

        So if you want to discuss fundamental differences you can pretty much only point out that dict values are accessible by keys, which could be of any immutable type, while list values are indexed with integers. That’s it :-)

        You can’t insert arbitrary keys into a list. You have to “insert” (append) in key order. That seems like a pretty fundamental difference to me!

        1. 5

          Yes, and in particular, if you insert an element into the middle of a list, all of the “keys” of the subsequent elements spontaneously change. Not very map-like!

        2. 6

          which could be of any immutable type

          That should read “of any hashable type” … though I’ll admit it’s not a great idea, if it can be hashed it can be used as a key in a dict.

          class WTF(dict):
              def __hash__(self):
                  return hash(str(self))
          w = WTF(whut = "?")
          d = {w: "daheck?"}

          Note that w is definitely mutable.

          1. 2

            I specifically refrained from going there to avoid the discussion whether mutable types should be implemented as hashable. It’s only tangentially related to the point of the post.

            1. 2

              Fair enough, I just point it out because as it the line leaves the reader with the impression that keys are much more restricted than they actually are.

          2. 4

            Not sets though! Yep, to my surprise sets are still unordered

            Doesn’t that mesh with the mathematical definition of a set?

            Just looked it up. Seems they can be either ordered or unordered. Today I learned something :)

            1. 4

              Not saying this is bad per se, but I’m not sure it’s a good idea either. How many bugs will there be now where people rely on the new behaviour and then the code breaks on older versions? :/

              Sure, it’s the coder’s fault, but “dicts are unordered” has been like that for.. I don’t know? 20 years? in Python, and 10 years in Python3…

              1. 1

                In what situation would you rely on something being unordered and then be surprised by it now being ordered? I guess if you’re relying on “random” values from calling something like next() on a dict?

                1. 2

                  I think he means “people assume it’s ordered, so if you run it on python 3.4 it breaks.”

                  1. 2

                    No, I meant the other way round. People notice “oh, it’s ordered now, I can rely on this” until they use an older version of Python.

                    I didn’t check if and how long 3.5 (if I got it right this came in 3.6) will be around, but maybe it will be just as bad as Python 2, if it’s the system install of LTS releases.

                    Not a huuge thing though.

                    1. 2

                      Not to worry, in 2 years any mention of Python versions newer than 3.6 will be consigned to the memory hole…

                2. 4

                  My gut feeling tells me it’s mostly due to the fact that a hash value of an int is the int itself, so there’s no time wasted on hashing.

                  Oh wow I hope not. Is this actually true in CPython?

                  1. 6

                    This is the most sensible implementation as you want to avoid collisions in a hash table. It isn’t supposed to bear any cryptographic properties if that’s your concern. Here’s more: https://github.com/python/cpython/blob/master/Objects/dictobject.c#L134

                    1. 5

                      It’s not the most sensible implementation, because simple key patterns cause collisions that never resolve, even when resizing the hashtable. The comment you linked specifically mentions this pathology, and the numerous ways it destroys performance.

                      The rest of the comment describes how CPython mitigates the issue by adding weak integer hashing to its collision probing. At first I thought integer keys were never hashed at any point, hence my surprise.

                      From the comments it sounds like sequential integer dict keys are somehow common in Python, which I don’t understand. But I don’t write much Python.

                      1. 6

                        From the comments it sounds like sequential integer dict keys are somehow common in Python, which I don’t understand. But I don’t write much Python.

                        While you can have a dict with keys of any hashable type – and a single dict may have keys of many types – the most common case, so overwhelmingly more common that it’s almost not even worth thinking about other cases, is a dict whose keys are all strings. This is because, sooner or later, basically everything in Python is backed by a dict. Every namespace is backed by a dict with string keys (the names defined in that namespace). Every object is backed by a dict with string keys (the names of the object’s attributes and methods). Keyword arguments to functions/methods? Yup, dict. In comparisons of languages by their “one big idea”, Python is sometimes described as having its big idea be “what if everything was a string-keyed hash table”?

                        Anyway. This is so common that Python goes out of its way to have special-case optimized implementations for the case of a dict whose keys are all strings (and for what it’s worth, in CPython as of Python 3.4, str is hashed using SipHash-2-4).

                        As to hashing of numeric types, it’s a bit more complicated than “ints hash to themselves”. Here’s what the Python documentation has to say. For the specific case of int, you can think of it as reducing to hash(n) == hash(n % sys.hash_info.modulus), where in CPython sys.hash_info.modulus is 2^61 - 1 on systems with 64-bit long and 2^31 - 1 on systems with 32-bit long.

                        While I don’t have a way of being certain, I suspect the linked comment’s note that the hashing of int is “important” has to do with the importance of real-world int key values being unlikely to collide with the hashes of other common real-world key types.

                        1. 1

                          In comparisons of languages by their “one big idea”, Python is sometimes described as having its big idea be “what if everything was a string-keyed hash table”?

                          I’ve always thought of PHP’s “one big idea” as “What if everything is an array” where array means PHP’s strange half-dict half-list interface that funnily enough Python is now one small step closer to.

                      2. 1

                        Avoiding collisions isn’t as important as using up a larger % of the spots before you need allocate and move things, I believe.

                        1. 3

                          Aren’t those the same thing? Less collisions implies you can go longer without expanding.

                          1. 1

                            It depends on the exact implementation, but in my understanding, not exactly; you also want a good distribution between your buckets, even if there are patterns / non-random distributions in the actual encountered keys. It might waste space rather than time, but it’s still not great.

                            1. 3

                              Python’s hash table isn’t implemented as an array-of-buckets. It’s a single contiguous array into which you insert a new element at the position determined by the hash of its key, and if that position is occupied you try the next one in a pseudo random order. Same with lookups: you try entries in succession until you find the one that equals (it’s usually the first one). And this is why the number of free spots and the probability of collisions are directly related.

                      3. 2

                        it is:

                        Python 3.7.6 (default, Dec 21 2019, 11:56:31)
                        [Clang 10.0.1 (clang-1001.0.46.4)] on darwin
                        Type "help", "copyright", "credits" or "license" for more information.
                        >>> hash(2)
                        >>> hash(37)
                        >>> hash(892474)
                        1. 8

                          Almost! hash(-1) returns -2.

                          Python 3.8.1 (default, Jan  8 2020, 23:09:20)
                          [GCC 9.2.0] on linux
                          Type "help", "copyright", "credits" or "license" for more information.
                          >>> hash(-1)
                          >>> hash(-2)
                          >>> hash(-3)
                          1. 4


                            do you happen to know why?

                            1. 7

                              Ah, it’s because the C API function uses -1 as an error code. It goes deeper than that too:

                              In [1]: class yolo:
                                 ...:     def __hash__(self):
                                 ...:         return -1
                              In [2]: y = yolo()
                              In [3]: hash(y)
                              Out[3]: -2
                          2. 2

                            I’ve heard that this is a somewhat common way to implement hashing for ints, but I don’t understand why it’s a good idea. Isn’t hash collisions terrible for hash tables? And isn’t a somewhat common key pattern “some number with some low bits masked”? And wouldn’t that be a pathological case for a hash table which grows with a factor of 2?

                            Are hash table implementations which does hash(x) = x somehow better at handling collisions than most hash tables, or do they just hope that the ints people want to put in their tables have high entropy in the lower bits?

                            1. 3

                              IIRC there is some sort of random salt added to it and it goes through some internal hash for the actual hash table, since there was a DoS attack by abusing worst case scenario over HTTP requests.

                          3. 2

                            Why would that be a problem?

                          4. 3

                            I just want to mention that this is exemplary Lobste.rs content. It’s clearly written, interesting, and the author engages in feedback on the page itself, which is good for historical reasons.

                            I was inspired to Google a bit for how Perl handles this stuff: https://www.perl.com/pub/2002/10/01/hashes.html/

                            1. 2

                              Thank you very much for saying thank you! I should write some more, it seems :-)

                            2. 2
                              1. 2

                                Ooh, this makes me uncomfortable. The basic rationale of structure-sharing is interesting, but that is deeply a performance decision. The use-cases people have talked about are also interesting, but I don’t think most use of dicts relies on or even benefits from this behavior.

                                My preference would have been to expose OrderedDict as a supported implementation (if it wasn’t already), but left it behind the Dict abstraction.

                                If a future version of Python wants a different behavior, this will be impossible to safely undo at scale, since there’s nothing static analysis can do to check for reliance on this behavior.

                                1. 3

                                  Yeah, the not-constraining-future-implementations way to go would have been to keep the order unspecified and to actually add randomness to the hash iterator at the same time as the storage was made deterministic, so as to render any code that depends on the ordering obviously broken.

                                  For a lot of years, Perl advertised hashes as being unordered, but the reality was that the retrieval order on iteration was deterministic in 99.99% of cases, and a few packages, entirely by accident, ended up with behavior reliant on this (mostly test cases relying on output being in a certain order, but also a few cases where e.g. an object depended on one of its slots being initialized before another, and that just happened to be true with the fixed “random” order). Then, in order to harden Perl against potential DoS attacks, a change came along that perturbed all hash values using a random seed chosen at process startup. And it got pushed through and released by vendors pretty quickly because it was considered a security fix. Needless to say, all of those buggy packages that were relying on the previous ordering got detected very quickly :)

                                2. 1

                                  Ruby also figured that out some time ago.


                                  Next Generation Shell

                                  (author here)

                                  Hash (a dictionary) was ordered from the beginning.

                                  $ ngs -pi '{"a": 1, "b":2, "c": 3}.keys()'
                                  Arr of size 3
                                    [0] = a
                                    [1] = b
                                    [2] = c

                                  Set is also ordered and in NGS it is thin wrapper around Hash (79 lines of NGS code, methods documented at https://ngs-lang.org/doc/latest/generated/Set.html )

                                  In both cases the comparison is order-sensitive. Thinking of it now, while it does make sense that Set comparison would ignore order, I didn’t have the use case yet. I guess I might need to change it sometime. Meanwhile less elegant workaround exists:

                                  $ ngs -pi 's1=Set([1,2]); s2=Set([2,1]); s1<=s2 and s2<=s1'
                                  Bool: true
                                  1. 2

                                    Thinking of it now, while it does make sense that Set comparison would ignore order, I didn’t have the use case yet.

                                    I often use it in tests where I need to compare a set of actual results with expected ones, and the function under test produces result in an undefined/changing order.

                                    1. 1

                                      Mmm.. I didn’t have this situation yet. Probably because both Hash and Set are ordered and other algorithms I have at the moment are deterministic. I guess I can have this situation in future, when dealing with external input.

                                      Interesting that meanwhile my mini test framework has assert_hash_keys_values() that does check that expected key-value pairs are in actual input but does not check that actual has no extra fields. I guess I made it for future-proof tests.

                                      expected.each(F(k, v) {
                                      	assert_base(actual, "key '$k' with value '$v'", {has(actual, k) and actual[k] == v}, "to have", title)

                                      Source: https://github.com/ngs-lang/ngs/blob/v0.2.7/lib/autoload/test.ngs#L145-L155

                                  2. 1

                                    Hurray! That was one gotcha that made switching from PHP to Python harder.