1. 18

  2. 10

    I’ve used maps in a load of programs. I sometimes want the behaviour of C++‘s std::map, where the object have a total ordering (that I define) and iteration should reflect that ordering. I can’t think of a single time when I’ve wanted iteration order to be insertion order. About the only use case I can think of is if you’re implementing some kind of event coalescing mechanism, where you insert things into a set (i.e. don’t insert them if equivalent events are there already) and remove them in FIFO order but that’s such a niche use case that I’d probably want a custom data structure for it rather than trying to use the standard one.

    1. 6

      The benefit of insertion order iteration is that it is deterministic. Years of experience in JS basically said you can’t do anything that is semi deterministic without suffering pain. So the options are similar to Go, etc that deliberately randomize enumeration order or going fully deterministic.

      Couple that with the desire to support mutation while iterating a map/set and insertion order is the only thing that really makes sense.

      The latter requirement makes sense as for whatever reason mutating structures while enumerating them is a common idiom in web content so having new structures that don’t support the idiom would just result in people continuing to use objects as their associative arrays and simply add Map and Set to the things that they complain about :)

      1. 4

        The benefit of insertion order iteration is that it is deterministic

        I think that’s actually a problem, not a benefit, because it means that you can easily depend on a property of a specific data structure implementation. LLVM, for example, has a build mode that reverses iteration order for all container types for which iteration order is not defined. This catches places where people have accidentally depended on iteration order for deterministic output (so does testing on multiple platforms, because the order often depends on memory layout and that will differ between memory allocators). If you require iteration order to be explicitly specified (undefined, insertion order, total order of elements) as part of the data structure’s definition then you never get into the case of someone replacing a map with a different implementation and removing deterministic behaviour that you depend on. It’s a dangerous default because you can always move from undefined to defined (and implement undefined as defined in different ways on different platforms) without breaking code, but you can’t move from defined to undefined.

        1. 2

          JS is a language for the web, so nothing can be undefined. The only bits that still have real variation allowed are those that for historical reasons could not be merged without breaking existing content, and even those are edge cases (adding and removing properties on an object’s prototype chain while iterating the object properties being the classic example).

          So start from “the behavior must be defined” - for consistently random enumeration between engines you’d need to specify that different iterators of the same object would need different enumeration order, that enumerating the same object (without mutation) multiple times would require different enumeration order each time, etc

          And then we get to the elephant in the room - consistent enumeration in the face of mutations while iterating. You would need all extant iterators to visit new elements, and skip deleted ones, without visiting existing entries again, unless a property was removed and re-added.

          Order of insertion is not particularly expensive, works for disparate types for keys, and has an easily explained semantic effect, and also works cleanly when mutating while enumerating the same object.

      2. 6

        Sometimes I want a consistent order, and I don’t care what it is. If insertion order is convenient, then it’s probably good enough. For instance, I’ve worked with APIs that dumped a hashmap as JSON where the key order changes between every call. It’s usually just a minor nuisance, but it would be nice if it were consistent.

        1. 6

          Why do you care what the order of keys in a json object is? It’s explicitly non specified.

          In your case it’s probably an indication that the API uses a proper seeded hashtable internally, to avoid DoS issues.

          1. 15

            Why do you care what the order of keys in a json object is? It’s explicitly non specified.

            If I read some JSON file, modify it, and then write it back again then it’s quite nice if the key order just stays the same. Especially if you have them in a git repo so diffs are actually useful. This is not an uncommon scenario. For example yarn add pkg changing the entire order of package.json and package.lock wouldn’t be especially great. The same applies to other file formats that don’t specify an order such as YAML or TOML; in many of those cases it’s even more important as often the file is also intended for manual editing.

            1. 4

              From the spec, it seems that json objects can have duplicate keys, so in theory you’d need a multimap :-). Goes to show how underspecified json is. In my little corner of the programming world, people parse json objects into association lists, so the order is preserver anyway, but thank you for the relevant use case.

            2. 6

              Usually, it’s just annoying, like when trying to diff the output between calls. Although it can cause real problems, such as when the hash sum of the response is used as a cache key (not that there aren’t other ways to fix that).

              1. 4

                If you are using some software that generates json as its on disk representation (e.g., PowerBI) and you need to version control it you can have really bad diffs if it changes every time it’s loaded and saved. Makes it hard to do proper pull requests. That’s an example of a bad serialization format pick by the software maker, but something you just have to live with as a user.

                Edit: I just said the same thing as arp242. This is why you read the child comments before reply…

            3. 4

              I can’t think of a single time when I’ve wanted iteration order to be insertion order.

              One example is where you’re getting data from some external source (file, command output), don’t care about certain duplicates, and use a map to filter those out. But you do want to preserve the original order in which the output was given. Basically, what the uniq command does, except without the requirement that lines need to be adjacent. I want this on a somewhat semi-regular basis, and have written more “map + array for order” loops than I can remember.

              1. 2

                I think if I need something like this I’d start with a set for detecting duplicates and an array / vector for the canonical version. The set wants very fast probing, so I’m happy with something that’s optimised for random access and I don’t care if it supports iteration at all. As the data comes in, I check if it’s in the set, if it isn’t I add it there and append to the array. In C++, I’d probably use a facade object in the set that actually looked in the array for the data, which then looks basically like the data structure described in the article. If I needed this more than once, I could wrap it up in a reusable template class (and provide both the map and array types as template parameters so I could tune it for a particular use case).

                1. 4

                  I don’t really know C++, but in Go a map is the “obvious” and easiest data structure to use. In Python you can use sets, but a lot of times the performance difference (if there even is any, it will probably use more memory but might run a few ns faster) will be negligible and using a single dict will be easier. In Ruby or PHP you’d probably use a hash or array, etc.

                  Either way, my point is merely that this is actually a somewhat common-ish use case. And besides, if you’re going to preserve order then you might as well default to insertion order. A lot of times order won’t matter at all and insertion order is a reasonable non-surprising default.

                  1. 1

                    I think I’d then phrase this slightly differently to the article. It’s not that the map hates you, it’s that Go and Python hate you by privileging one map type over all others. The most efficient map implementations do not provide a stable sort, so by privileging one map implementation you’re left in a world where people either hate you because using an efficient map for their use case is more clunky or they hate you because using a map that has the properties that they need for their use case is more clunky.

                    The general lesson is: one-size-fits-all data structures are a bad idea. This is a lot more obvious for built-in strings than for built-in maps because a lot of languages define a single concrete string representation and the use cases for strings are very different across application workloads.

                    1. 2

                      As I mentioned in another comment: in the case of Python it was an accidental property that fell out of making dicts more efficient, so clearly it’s not all that horribly inefficient.

                      Aside: in Go maps aren’t ordered (explicitly randomized actually). I just used this as an example where it would be convenient.

              2. 3

                I 100% agree. If I want this particular behavior (a queue without duplicates) I do it by combining a queue and a hashset (my main use case is graph traversal). In general that might involve keeping things in the set even after they’re processed, which this jack-of-all-trades, master-of-none data structure fails to accomplish.

                Why do people insist on having one single data structure that does everything?

                1. 5

                  this jack-of-all-trades, master-of-none data structure

                  In the case of Python it was more of an accidental side-effect that fell out of a new dict implementation. People liked it and it was added to the language specification. It’s more memory-efficient and either equal or faster in runtime performance. It was inspired by the more performance-focused PyPy.

                  If it was slower or if there was some weird confusion between data structures like PHP’s “arrays” – which can be either an an “array” or “hashmap” with some surprising “magic” conversions and fairly high performance impact – then, sure, I’d agree. But nothing at all was sacrificed here, and it’s convenient in quite a few cases. There is no trade-off, no “mastery” was lost. It’s just a win.

                  1. 3

                    There’s a loss in that it’s now unnecessarily constraining future implementation options. Every promise is a loss of flexibility. Even if an order-preserving hash table was the best available implementation now, who is to say that it’s still the best one in two years?

                    1. 5

                      Given that Python has managed quite well without this more efficient implementation for almost 30 years, that this has been researched to some extent and spectacular breakthroughs in the foreseeable future are unlikely (though not impossible of course), and that insertion order is considered a desirable feature by many, I think the trade-off is a pretty good one.

                      1. 6

                        Python has managed quite well without efficiency in all areas. I wouldn’t look there if you’re looking for arguments about data structures where efficiency matters.

                        I also agree that insertion order is something that I’ve wanted incredibly rarely.

                  2. 1

                    “It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures.” - Alan Perlis

                    If you have one data structure you can have many types of transformations. It’s harder to get that level of depth with many data structures.

                    That said you do probably want more orthogonal data structures. Otherwise, you can have bad performance problems. E.g., using a linked list for everything in functional languages can have nasty performance issues.

                  3. 1

                    PHP preserves insertion order and that’s used for all kinds of things. It can be used to deduplicate arrays. It can be used to round-trip tree structures (e.g. edit a config file without sorting the keys). It’s even used for DSLs.

                    It can be thought of as a vector that comes with a fast lookup index. Of course you could achieve the same (and more flexible) with a vec and a separate hashmap, when it’s a single data structure it’s just easy, and there’s no need to worry about keeping consistency between vec and its index.

                    1. 1

                      In the implementations that I’m aware of (Python and Rust’s indexmap), this sort of hashmap tends to be faster in macro-benchmarks. This is likely due to iteration being faster over the dense array of KV pairs, and iteration is pretty common. Certainly that is worth caring about.

                      I definitely reach for indexmap all the time. Insertion order is useful sometimes, but sorting the map is also frequently useful. Also, just being able to quickly access KV pairs via indexing is a great way to avoid hashing in many cases.

                      1. 2

                        This is likely due to iteration being faster over the dense array of KV pairs, and iteration is pretty common.

                        That’s very workload dependent. A lot of the places I’ve used maps, I never iterate: I want associative lookup and that’s why I chose a map as the interface to the underlying data structure. In some uses, I really care about fast insertion and removal, in some I care about lookup performance at the expense of everything else. In some I care about a broader balance. If I care about iteration performance then I might want to use a map with guaranteed iteration order but for the use cases where I don’t care about iteration performance (I’d guess that most of the code I’ve written that iterates over a map does so purely for debugging / asserting invariants on the map contents) I don’t want a set of additional constraints.

                        Last time I benchmarked a map on anything I cared about, Hopscotch and Robin Hood maps had the best overall performance (one was slightly lower memory consumption, the other slightly faster throughput). The maps with guaranteed iteration order that I tried were measurably worse in both metrics.

                    2. 1

                      Go and Perl decided to fix this by making hash ordering more random.

                      I like the sound of this–if I can’t rely on the order, I’d rather know right away. Anyone with Go or Perl experience: do you find this helpful? Or harmful?

                      1. 2

                        I’ve always been aware that calling keys on a Perl hash will return the keys in an unpredictable order. Apparently “better” randomness was added explicitly because the actual ordering wasn’t entirely random, but could be predicted to cause denial-of-service in, for example, CGI calls.

                        Typing for my $key ( sort {$a <=> $b} keys %hash) {say $key} [1] is almost second nature to me now :D

                        [1] assuming the keys are numeric and I want them numerically sorted.

                        1. 2

                          LLVM doesn’t go that far, but does provide a build option to reverse iteration order of containers that don’t have a defined iteration order. All of the LLVM tests expect some deterministic output and all are tested by CI in configurations with both orders defined. It’s really helpful for catching places where folks accidentally depend on iteration order. Random would make it easier to catch these in local runs (just run the test twice and you’ll likely see failures) but unless you have a way of seeding the random number generator (and guaranteed ordering) it may be tricky to reproduce intermittent failures.

                        2. 1

                          The other feature you quickly find yourself needing is the ability to use arbitrary objects as keys in a hash map. In languages that don’t support that (Perl, Python, Go, JavaScript until recently) you can use strings as keys, generating a unique string for each key object.

                          Go does support arbitrary values as keys.

                          1. 2

                            With the caveat that the type must be comparable. For example, the type may not be a map, slice (array is ok) or a struct containing a map or a slice.