Sounds like a great way to replace your problem with “come up with a useful set of rewrites that can be explored before the end of the universe”. Sounds like a fun paper to read! Lemme give it a skim…
Upon reaching a fixed point (saturation), 𝐸 will represent all equivalent ways to express 𝑝 with respect to the given rewrites. After saturation (or timeout), a final extraction procedure analyzes 𝐸 and selects the optimal program according to a user-provided cost function.
So it searches until some termination condition is reached or until it times out. On the one hand it sounds like a potentially really good approach and on the other it seriously pings my “all you need to do is tune your parameters right! Now you have two problems!” senses. I admit that being able to say “spend <= 5 seconds optimizing this program” or “spend <= 4 hours optimizing this program since this is running overnight in CI anyway” sounds pretty appealing. So, I’ll reserve judgement until I see it in practice more.
Author of that paper here. You are right that many sets of rewrites can spin out until the end of time. But, in practice, we’ve found that saturation is rarely necessary: you typically find what you’re looking for pretty early!
When you extract a term, you know it’s the best thing you know about so far. Unless you saturate, the next best thing might be “right around the corner”. But that’s true for a lot of optimization processes.
Would be interesting to see if you could hook up a SAT solver somehow, to prove that no better case exists. I’ve thought about it in a more general case of ISAs, but that’s very hard to model for SAT, maybe it could be easier with egraphs.
I guess? There might be a difference from traditional ones though in that with e-graphs you can already get pretty close, and they might be easier and faster to model in solvers, since apparently they already use them for representation.
Thanks! I mostly spend my time in the /r/programminglanguages Discord for better or worse. People there mentioned that egraphs still have a hard time dealing with control flow equivalences though, have any good sources for discussion on the problems or solutions involved with that?
Yes, dealing with control flow is difficult. Ultimately, while the “put all your rewrites in and push go” story is easy to sell, I don’t think it’s realistic. I think where we need to go (and this is stuff we are currently working on) is how to combine egraphs with more conventional techniques. Being able to represent many programs “superimposed” on one another is just too good to give up. We have seen success in many domains with the more simplistic plug-and-chug approach, but it won’t scale to genera purpose languages. the cranelift folks realized this and are still carefully selecting what things to do in the egraph.
Thanks, it’s nice to know about the gotchas as well. At worst it sounds like there’s potential for good and flexible transformations within basic blocks, which is still pretty useful. Is the problem that the number of valid transformations/combinations explodes when you start considering control flow and it’s just hard to deal with, or that it’s hard to measure which program variant is best, or something else?
…oh, this just occurred to me. If I can bug you more… do you know if anyone has tried applying egraphs to SIMD auto-vectorization and and how well it works if so? Seems like a place it could shine, given how hard vectorization is to represent by other methods. …I might just have to start digging through google scholar and compiler conference talks…
E-graphs very naturally represent expressions; representing statements and control flow (or anything else that requires context) gets awkward. It’s definitely possible, just needs more work!
I’ve somehow not heard of e-graphs, interesting! The end of the POPL 2021 presentation mentioned projects using egg that involved “Java testing” and “educational problems”, does anyone know what these are?
Author here: those projects are a little old, I should update the list. The java testing was a course project and the education problems was a project that I don’t think went anywhere ultimately (I was not involved in either directly).
Is anybody using this to implement the optimizer for a more-or-less general purpose programming language? (Instead of the standard approach of multiple different passes that must be ordered by hand.)
Cammy has trivial alpha- and eta-equivalence, and no lambda-abstractions. In fact, my jelly optimizer doesn’t know whether an atom refers to a builtin function or a composite. This was essential to getting equational reasoning which doesn’t need to manage scopes or shadowing. Also, Cammy is total and pure, so optimizations don’t have to speculate or otherwise be conditional on a context. Finally, Cammy’s textual syntax is a flavor of S-expressions which is parseable by Scheme, OCaml, and egg’s builtin SymbolLang, so I did not have to write a new AST type in Rust.
I’ve used maps in a load of programs. I sometimes want the behaviour of C++‘s std::map, where the object have a total ordering (that I define) and iteration should reflect that ordering. I can’t think of a single time when I’ve wanted iteration order to be insertion order. About the only use case I can think of is if you’re implementing some kind of event coalescing mechanism, where you insert things into a set (i.e. don’t insert them if equivalent events are there already) and remove them in FIFO order but that’s such a niche use case that I’d probably want a custom data structure for it rather than trying to use the standard one.
The benefit of insertion order iteration is that it is deterministic. Years of experience in JS basically said you can’t do anything that is semi deterministic without suffering pain. So the options are similar to Go, etc that deliberately randomize enumeration order or going fully deterministic.
Couple that with the desire to support mutation while iterating a map/set and insertion order is the only thing that really makes sense.
The latter requirement makes sense as for whatever reason mutating structures while enumerating them is a common idiom in web content so having new structures that don’t support the idiom would just result in people continuing to use objects as their associative arrays and simply add Map and Set to the things that they complain about :)
The benefit of insertion order iteration is that it is deterministic
I think that’s actually a problem, not a benefit, because it means that you can easily depend on a property of a specific data structure implementation. LLVM, for example, has a build mode that reverses iteration order for all container types for which iteration order is not defined. This catches places where people have accidentally depended on iteration order for deterministic output (so does testing on multiple platforms, because the order often depends on memory layout and that will differ between memory allocators). If you require iteration order to be explicitly specified (undefined, insertion order, total order of elements) as part of the data structure’s definition then you never get into the case of someone replacing a map with a different implementation and removing deterministic behaviour that you depend on. It’s a dangerous default because you can always move from undefined to defined (and implement undefined as defined in different ways on different platforms) without breaking code, but you can’t move from defined to undefined.
JS is a language for the web, so nothing can be undefined. The only bits that still have real variation allowed are those that for historical reasons could not be merged without breaking existing content, and even those are edge cases (adding and removing properties on an object’s prototype chain while iterating the object properties being the classic example).
So start from “the behavior must be defined” - for consistently random enumeration between engines you’d need to specify that different iterators of the same object would need different enumeration order, that enumerating the same object (without mutation) multiple times would require different enumeration order each time, etc
And then we get to the elephant in the room - consistent enumeration in the face of mutations while iterating. You would need all extant iterators to visit new elements, and skip deleted ones, without visiting existing entries again, unless a property was removed and re-added.
Order of insertion is not particularly expensive, works for disparate types for keys, and has an easily explained semantic effect, and also works cleanly when mutating while enumerating the same object.
Sometimes I want a consistent order, and I don’t care what it is. If insertion order is convenient, then it’s probably good enough. For instance, I’ve worked with APIs that dumped a hashmap as JSON where the key order changes between every call. It’s usually just a minor nuisance, but it would be nice if it were consistent.
Why do you care what the order of keys in a json object is? It’s explicitly non specified.
If I read some JSON file, modify it, and then write it back again then it’s quite nice if the key order just stays the same. Especially if you have them in a git repo so diffs are actually useful. This is not an uncommon scenario. For example yarn add pkg changing the entire order of package.json and package.lock wouldn’t be especially great. The same applies to other file formats that don’t specify an order such as YAML or TOML; in many of those cases it’s even more important as often the file is also intended for manual editing.
From the spec, it seems that json objects can have duplicate keys, so in theory you’d need a multimap :-). Goes to show how underspecified json is. In my little corner of the programming world, people parse json objects into association lists, so the order is preserver anyway, but thank you for the relevant use case.
Usually, it’s just annoying, like when trying to diff the output between calls. Although it can cause real problems, such as when the hash sum of the response is used as a cache key (not that there aren’t other ways to fix that).
If you are using some software that generates json as its on disk representation (e.g., PowerBI) and you need to version control it you can have really bad diffs if it changes every time it’s loaded and saved. Makes it hard to do proper pull requests. That’s an example of a bad serialization format pick by the software maker, but something you just have to live with as a user.
Edit: I just said the same thing as arp242. This is why you read the child comments before reply…
I can’t think of a single time when I’ve wanted iteration order to be insertion order.
One example is where you’re getting data from some external source (file, command output), don’t care about certain duplicates, and use a map to filter those out. But you do want to preserve the original order in which the output was given. Basically, what the uniq command does, except without the requirement that lines need to be adjacent. I want this on a somewhat semi-regular basis, and have written more “map + array for order” loops than I can remember.
I think if I need something like this I’d start with a set for detecting duplicates and an array / vector for the canonical version. The set wants very fast probing, so I’m happy with something that’s optimised for random access and I don’t care if it supports iteration at all. As the data comes in, I check if it’s in the set, if it isn’t I add it there and append to the array. In C++, I’d probably use a facade object in the set that actually looked in the array for the data, which then looks basically like the data structure described in the article. If I needed this more than once, I could wrap it up in a reusable template class (and provide both the map and array types as template parameters so I could tune it for a particular use case).
I don’t really know C++, but in Go a map is the “obvious” and easiest data structure to use. In Python you can use sets, but a lot of times the performance difference (if there even is any, it will probably use more memory but might run a few ns faster) will be negligible and using a single dict will be easier. In Ruby or PHP you’d probably use a hash or array, etc.
Either way, my point is merely that this is actually a somewhat common-ish use case. And besides, if you’re going to preserve order then you might as well default to insertion order. A lot of times order won’t matter at all and insertion order is a reasonable non-surprising default.
I think I’d then phrase this slightly differently to the article. It’s not that the map hates you, it’s that Go and Python hate you by privileging one map type over all others. The most efficient map implementations do not provide a stable sort, so by privileging one map implementation you’re left in a world where people either hate you because using an efficient map for their use case is more clunky or they hate you because using a map that has the properties that they need for their use case is more clunky.
The general lesson is: one-size-fits-all data structures are a bad idea. This is a lot more obvious for built-in strings than for built-in maps because a lot of languages define a single concrete string representation and the use cases for strings are very different across application workloads.
As I mentioned in another comment: in the case of Python it was an accidental property that fell out of making dicts more efficient, so clearly it’s not all that horribly inefficient.
Aside: in Go maps aren’t ordered (explicitly randomized actually). I just used this as an example where it would be convenient.
I 100% agree. If I want this particular behavior (a queue without duplicates) I do it by combining a queue and a hashset (my main use case is graph traversal). In general that might involve keeping things in the set even after they’re processed, which this jack-of-all-trades, master-of-none data structure fails to accomplish.
Why do people insist on having one single data structure that does everything?
this jack-of-all-trades, master-of-none data structure
In the case of Python it was more of an accidental side-effect that fell out of a new dict implementation. People liked it and it was added to the language specification. It’s more memory-efficient and either equal or faster in runtime performance. It was inspired by the more performance-focused PyPy.
If it was slower or if there was some weird confusion between data structures like PHP’s “arrays” – which can be either an an “array” or “hashmap” with some surprising “magic” conversions and fairly high performance impact – then, sure, I’d agree. But nothing at all was sacrificed here, and it’s convenient in quite a few cases. There is no trade-off, no “mastery” was lost. It’s just a win.
There’s a loss in that it’s now unnecessarily constraining future implementation options. Every promise is a loss of flexibility. Even if an order-preserving hash table was the best available implementation now, who is to say that it’s still the best one in two years?
Given that Python has managed quite well without this more efficient implementation for almost 30 years, that this has been researched to some extent and spectacular breakthroughs in the foreseeable future are unlikely (though not impossible of course), and that insertion order is considered a desirable feature by many, I think the trade-off is a pretty good one.
Python has managed quite well without efficiency in all areas. I wouldn’t look there if you’re looking for arguments about data structures where efficiency matters.
I also agree that insertion order is something that I’ve wanted incredibly rarely.
“It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures.” - Alan Perlis
If you have one data structure you can have many types of transformations. It’s harder to get that level of depth with many data structures.
That said you do probably want more orthogonal data structures. Otherwise, you can have bad performance problems. E.g., using a linked list for everything in functional languages can have nasty performance issues.
PHP preserves insertion order and that’s used for all kinds of things. It can be used to deduplicate arrays. It can be used to round-trip tree structures (e.g. edit a config file without sorting the keys). It’s even used for DSLs.
It can be thought of as a vector that comes with a fast lookup index. Of course you could achieve the same (and more flexible) with a vec and a separate hashmap, when it’s a single data structure it’s just easy, and there’s no need to worry about keeping consistency between vec and its index.
In the implementations that I’m aware of (Python and Rust’s indexmap), this sort of hashmap tends to be faster in macro-benchmarks. This is likely due to iteration being faster over the dense array of KV pairs, and iteration is pretty common. Certainly that is worth caring about.
I definitely reach for indexmap all the time. Insertion order is useful sometimes, but sorting the map is also frequently useful. Also, just being able to quickly access KV pairs via indexing is a great way to avoid hashing in many cases.
This is likely due to iteration being faster over the dense array of KV pairs, and iteration is pretty common.
That’s very workload dependent. A lot of the places I’ve used maps, I never iterate: I want associative lookup and that’s why I chose a map as the interface to the underlying data structure. In some uses, I really care about fast insertion and removal, in some I care about lookup performance at the expense of everything else. In some I care about a broader balance. If I care about iteration performance then I might want to use a map with guaranteed iteration order but for the use cases where I don’t care about iteration performance (I’d guess that most of the code I’ve written that iterates over a map does so purely for debugging / asserting invariants on the map contents) I don’t want a set of additional constraints.
Last time I benchmarked a map on anything I cared about, Hopscotch and Robin Hood maps had the best overall performance (one was slightly lower memory consumption, the other slightly faster throughput). The maps with guaranteed iteration order that I tried were measurably worse in both metrics.
I agree the String thing is confusing, in fact the author didn’t list quite a few different string types that exist, and listed no where near the amount of string conversions or ways to get string-y arguments. However, it’s one of those cases where the underlying problem Rust solved with this confusion actually exists in (nearly?) all languages. Rust, in typical Rust fashion just makes you aware of all the footguns up front. Once you wrap your head around when to use each type, or what their various tradeoffs are, it makes perfect sense. So much so that I’ll get frustrated with other languages that paper over these details leading to bugs. Bottom line, strings in general are hard, really hard.
I think the distinctions that Rust make are useful and necessary. However, I think one of the problems is that the types are confusingly named. I think String should have been called StringBuf and OsStringOsStringBuf, just as you have Path and PathBuf`.
I think an additional problem that make slices ([T]) and string slices (str) difficult to understand is that they are unsized, built-in types. So, people have to understand the difference between e.g. &str and str and why you cannot just put str in e.g. a struct. I know that there are good reasons for why string references are as they are, but I think from a learning curve perspective, it would have been easier if string slices were something along the lines of a simple Copyable type:
Having references to unsized slices is necessary to avoid a performance hit. The StringSlice type above is 1 word larger than &str (which is just a pointer and a length). More importantly it has an additional layer of indirection: buf points to the StringBuf which points to the data, while &str points directly at the relevant data.
You don’t have to convince me. Like I said, there are good reasons for the current representation. It’s just that it makes (string) slices more opaque.
This is also quite opposite to many other types Rust, which are transparent and can be understood by just reading the standard library.
One of my favorite examples is the BinaryHeap::peek_mut method, which would be completely unsafe in another language (since you can modify the tip of the heap, which invalidates the heap property), but in Rust it can be done without any magic. The borrow system takes care of that you can only have one mutable reference (so no-one else can have a view of the heap when the heap property is temporarily broken), the Drop implementation of PeekMut takes care of restoring the heap property when necessary.
Thanks a lot for the link, I never heard of it and it’s super relevant (especially the list of related papers). How did you hear about that project? I’m surprised I missed it!
As someone who’s been toying with making a game for fun for a while, I’ve been excited by ggez. It’s very easy to get started with something on the screen (just like LÖVE, as advertised). But you can also write tight loops and such and not worry about performance as much.
Glad to see the author prioritizing their own needs and taking a break. Hopefully the community picks this one up (I’d like to as well!).
Please note this is from April; a lot has happened.
This week, the networking WG is supposed to be making some posts overviewing the state of play as it is today. We have landed futures in the core library as well as the first implementation of async/await in the compiler. Still more work to do, though!
Probably following the network working group newsletter; one should be out this week with a summary of where we’re at with everything, and what is left to be done. I’ll make sure it gets posted to lobste.rs.
I think you’re asking about Tokio; the standard library doesn’t provide an executor. The executor is implicit. You still call tokio::run, but that’s it. See here: https://tokio.rs/blog/2018-03-tokio-runtime/
Personally, I just Keychain for iOS and OS X. It takes very little setup, it’s definitely secure enough for my personal use case, and it Just Works. Generating and storing random passwords across all my iDevices couldn’t really be easier.
Why don’t you filter out every tag instead of the ones you want to see, temporarily? I do not think it is often the case that people want to see a particular combination of tags… but just for that day.
Consider the compilers and compsci example. If there’s just a few tags I want to look at, filtering out all the rest (and then unfiltering when I’m done) is rather cumbersome.
I finished The Martian last night. It’s an amazing book. It’s sci-fi, but with enough realism in it to make is sound plausible. The stranded man is an engineer, so there are many detailed descriptions of how he fits things together to make new things he needs.
Read this last week, only took a few hours as it pulled me right in. Fantastic book, excited for the movie; though I’m a bit weirded out that Dr. Mann is going to be stranded on Mars..
I’m excited to start it; it’s next on my (ever-growing) list after I finish Infinite Jest.
How is the writing itself? I’ve heard great things about the realism in the sci-fi, but not much about the story or prose outside of the technical details.
There’s for sure sections of the book that read like a really interesting spreadsheet…but it doesn’t fall into that pit of genre fiction where they spend too much time on genre and not enough time on solid writing. It’s good.
One thing I’ve found useful is ctrl+x ctrl+e to edit the current command in $EDITOR. That way I can leave the emacs-style keybindings at the shell, as you’ll find that everywhere, but you can still edit unwieldy commands in real vim.
I am (and have been for the past few weeks) still working through Infinite Jest. I’ve been enjoying it for the most part; though there have definitely been some parts I’ve had to slog through.
I’ve wanted to start reading some more educational material, particularly around PL and compilers, but I’ve been a little daunted by the idea of reading it in parallel with IJ. Suggestions and what and how to read would be appreciated!
Thanks for the reference, although I’m looking for something a couple steps beyond that as I’m already familiar with type theory and basic compiler implementation.
And good things about IJ? It’s an unprecedented and fascinating look into addiction, depression, and American views on these topics. Those are just the central themes (I’d say, anyway), and the author has much more to say along the way.
I’ve read the Dragon Book (Compilers: Principles, Techniques, and Tools, Aho et al.) at the time. It’s quite good (considered the seminal text on these matters), though if you’re already familiar with the basics, I don’t know if it’ll help you much.
For anyone confused, this seems to be a Python binding for this rust library https://egraphs-good.github.io/
E-graphs seem not to have been written about in an accessible way, but the closest is the Wikipedia and this notebook
https://en.m.wikipedia.org/wiki/E-graph
https://colab.research.google.com/drive/1tNOQijJqe5tw-Pk9iqd6HHb2abC5aRid?usp=sharing
This project provides bindings for a newer variant of that egg library called egglog that was recently discussed on lobsters.
There is a nearly perfect solution, namely equality saturation.
From the tutorial:
You have to build your own equivalence relations which can be a lot of work. While nothing is perfect, this beats the heck out of doing this by hand.
In the article it is briefly mentioned that the Cranelift compiler backend is considering/planning to adopt equality saturation! It’s a very cool approach.
Sounds like a great way to replace your problem with “come up with a useful set of rewrites that can be explored before the end of the universe”. Sounds like a fun paper to read! Lemme give it a skim…
So it searches until some termination condition is reached or until it times out. On the one hand it sounds like a potentially really good approach and on the other it seriously pings my “all you need to do is tune your parameters right! Now you have two problems!” senses. I admit that being able to say “spend <= 5 seconds optimizing this program” or “spend <= 4 hours optimizing this program since this is running overnight in CI anyway” sounds pretty appealing. So, I’ll reserve judgement until I see it in practice more.
Author of that paper here. You are right that many sets of rewrites can spin out until the end of time. But, in practice, we’ve found that saturation is rarely necessary: you typically find what you’re looking for pretty early!
Obligatory ad for the Zulip if you’re into this kind of thing: https://egraphs.zulipchat.com/
How do you know you have what you’re looking for?
When you extract a term, you know it’s the best thing you know about so far. Unless you saturate, the next best thing might be “right around the corner”. But that’s true for a lot of optimization processes.
Would be interesting to see if you could hook up a SAT solver somehow, to prove that no better case exists. I’ve thought about it in a more general case of ISAs, but that’s very hard to model for SAT, maybe it could be easier with egraphs.
I think you’re describing a variant of a superoptimiser?
I guess? There might be a difference from traditional ones though in that with e-graphs you can already get pretty close, and they might be easier and faster to model in solvers, since apparently they already use them for representation.
Thanks! I mostly spend my time in the /r/programminglanguages Discord for better or worse. People there mentioned that egraphs still have a hard time dealing with control flow equivalences though, have any good sources for discussion on the problems or solutions involved with that?
Yes, dealing with control flow is difficult. Ultimately, while the “put all your rewrites in and push go” story is easy to sell, I don’t think it’s realistic. I think where we need to go (and this is stuff we are currently working on) is how to combine egraphs with more conventional techniques. Being able to represent many programs “superimposed” on one another is just too good to give up. We have seen success in many domains with the more simplistic plug-and-chug approach, but it won’t scale to genera purpose languages. the cranelift folks realized this and are still carefully selecting what things to do in the egraph.
Thanks, it’s nice to know about the gotchas as well. At worst it sounds like there’s potential for good and flexible transformations within basic blocks, which is still pretty useful. Is the problem that the number of valid transformations/combinations explodes when you start considering control flow and it’s just hard to deal with, or that it’s hard to measure which program variant is best, or something else?
…oh, this just occurred to me. If I can bug you more… do you know if anyone has tried applying egraphs to SIMD auto-vectorization and and how well it works if so? Seems like a place it could shine, given how hard vectorization is to represent by other methods. …I might just have to start digging through google scholar and compiler conference talks…
E-graphs very naturally represent expressions; representing statements and control flow (or anything else that requires context) gets awkward. It’s definitely possible, just needs more work!
Re: vectorization https://www.cs.cornell.edu/~avh/diospyros-asplos-2021-preprint.pdf
In practice, we just put rewrite rules into egg and let its scheduler handle rule selection and timeouts.
I’ve somehow not heard of e-graphs, interesting! The end of the POPL 2021 presentation mentioned projects using egg that involved “Java testing” and “educational problems”, does anyone know what these are?
Author here: those projects are a little old, I should update the list. The java testing was a course project and the education problems was a project that I don’t think went anywhere ultimately (I was not involved in either directly).
Is anybody using this to implement the optimizer for a more-or-less general purpose programming language? (Instead of the standard approach of multiple different passes that must be ordered by hand.)
The cranelift folks are exploring using an e-graph (based on egg) in their compiler: https://github.com/bytecodealliance/rfcs/pull/27
Thanks for that reference. I’m looking forward to finding out how well this works in practice, for both code simplicity, and for performance.
I hadn’t seen that; very exciting!
I use egg in my implementation of Cammy. It is a short single-file optimizer with little magic.
The library and approach are fine, but the language needs to be intentionally designed for it.
Thanks, I’ll take a look. In what sense does the language need to be intentionally designed for it?
Cammy has trivial alpha- and eta-equivalence, and no lambda-abstractions. In fact, my jelly optimizer doesn’t know whether an atom refers to a builtin function or a composite. This was essential to getting equational reasoning which doesn’t need to manage scopes or shadowing. Also, Cammy is total and pure, so optimizations don’t have to speculate or otherwise be conditional on a context. Finally, Cammy’s textual syntax is a flavor of S-expressions which is parseable by Scheme, OCaml, and egg’s builtin
SymbolLang
, so I did not have to write a new AST type in Rust.I’ve used maps in a load of programs. I sometimes want the behaviour of C++‘s
std::map
, where the object have a total ordering (that I define) and iteration should reflect that ordering. I can’t think of a single time when I’ve wanted iteration order to be insertion order. About the only use case I can think of is if you’re implementing some kind of event coalescing mechanism, where you insert things into a set (i.e. don’t insert them if equivalent events are there already) and remove them in FIFO order but that’s such a niche use case that I’d probably want a custom data structure for it rather than trying to use the standard one.The benefit of insertion order iteration is that it is deterministic. Years of experience in JS basically said you can’t do anything that is semi deterministic without suffering pain. So the options are similar to Go, etc that deliberately randomize enumeration order or going fully deterministic.
Couple that with the desire to support mutation while iterating a map/set and insertion order is the only thing that really makes sense.
The latter requirement makes sense as for whatever reason mutating structures while enumerating them is a common idiom in web content so having new structures that don’t support the idiom would just result in people continuing to use objects as their associative arrays and simply add Map and Set to the things that they complain about :)
I think that’s actually a problem, not a benefit, because it means that you can easily depend on a property of a specific data structure implementation. LLVM, for example, has a build mode that reverses iteration order for all container types for which iteration order is not defined. This catches places where people have accidentally depended on iteration order for deterministic output (so does testing on multiple platforms, because the order often depends on memory layout and that will differ between memory allocators). If you require iteration order to be explicitly specified (undefined, insertion order, total order of elements) as part of the data structure’s definition then you never get into the case of someone replacing a map with a different implementation and removing deterministic behaviour that you depend on. It’s a dangerous default because you can always move from undefined to defined (and implement undefined as defined in different ways on different platforms) without breaking code, but you can’t move from defined to undefined.
JS is a language for the web, so nothing can be undefined. The only bits that still have real variation allowed are those that for historical reasons could not be merged without breaking existing content, and even those are edge cases (adding and removing properties on an object’s prototype chain while iterating the object properties being the classic example).
So start from “the behavior must be defined” - for consistently random enumeration between engines you’d need to specify that different iterators of the same object would need different enumeration order, that enumerating the same object (without mutation) multiple times would require different enumeration order each time, etc
And then we get to the elephant in the room - consistent enumeration in the face of mutations while iterating. You would need all extant iterators to visit new elements, and skip deleted ones, without visiting existing entries again, unless a property was removed and re-added.
Order of insertion is not particularly expensive, works for disparate types for keys, and has an easily explained semantic effect, and also works cleanly when mutating while enumerating the same object.
Sometimes I want a consistent order, and I don’t care what it is. If insertion order is convenient, then it’s probably good enough. For instance, I’ve worked with APIs that dumped a hashmap as JSON where the key order changes between every call. It’s usually just a minor nuisance, but it would be nice if it were consistent.
Why do you care what the order of keys in a json object is? It’s explicitly non specified.
In your case it’s probably an indication that the API uses a proper seeded hashtable internally, to avoid DoS issues.
If I read some JSON file, modify it, and then write it back again then it’s quite nice if the key order just stays the same. Especially if you have them in a git repo so diffs are actually useful. This is not an uncommon scenario. For example
yarn add pkg
changing the entire order ofpackage.json
andpackage.lock
wouldn’t be especially great. The same applies to other file formats that don’t specify an order such as YAML or TOML; in many of those cases it’s even more important as often the file is also intended for manual editing.From the spec, it seems that json objects can have duplicate keys, so in theory you’d need a multimap :-). Goes to show how underspecified json is. In my little corner of the programming world, people parse json objects into association lists, so the order is preserver anyway, but thank you for the relevant use case.
Usually, it’s just annoying, like when trying to diff the output between calls. Although it can cause real problems, such as when the hash sum of the response is used as a cache key (not that there aren’t other ways to fix that).
If you are using some software that generates json as its on disk representation (e.g., PowerBI) and you need to version control it you can have really bad diffs if it changes every time it’s loaded and saved. Makes it hard to do proper pull requests. That’s an example of a bad serialization format pick by the software maker, but something you just have to live with as a user.
Edit: I just said the same thing as arp242. This is why you read the child comments before reply…
One example is where you’re getting data from some external source (file, command output), don’t care about certain duplicates, and use a map to filter those out. But you do want to preserve the original order in which the output was given. Basically, what the
uniq
command does, except without the requirement that lines need to be adjacent. I want this on a somewhat semi-regular basis, and have written more “map + array for order” loops than I can remember.I think if I need something like this I’d start with a set for detecting duplicates and an array / vector for the canonical version. The set wants very fast probing, so I’m happy with something that’s optimised for random access and I don’t care if it supports iteration at all. As the data comes in, I check if it’s in the set, if it isn’t I add it there and append to the array. In C++, I’d probably use a facade object in the set that actually looked in the array for the data, which then looks basically like the data structure described in the article. If I needed this more than once, I could wrap it up in a reusable template class (and provide both the map and array types as template parameters so I could tune it for a particular use case).
I don’t really know C++, but in Go a map is the “obvious” and easiest data structure to use. In Python you can use sets, but a lot of times the performance difference (if there even is any, it will probably use more memory but might run a few ns faster) will be negligible and using a single dict will be easier. In Ruby or PHP you’d probably use a hash or array, etc.
Either way, my point is merely that this is actually a somewhat common-ish use case. And besides, if you’re going to preserve order then you might as well default to insertion order. A lot of times order won’t matter at all and insertion order is a reasonable non-surprising default.
I think I’d then phrase this slightly differently to the article. It’s not that the map hates you, it’s that Go and Python hate you by privileging one map type over all others. The most efficient map implementations do not provide a stable sort, so by privileging one map implementation you’re left in a world where people either hate you because using an efficient map for their use case is more clunky or they hate you because using a map that has the properties that they need for their use case is more clunky.
The general lesson is: one-size-fits-all data structures are a bad idea. This is a lot more obvious for built-in strings than for built-in maps because a lot of languages define a single concrete string representation and the use cases for strings are very different across application workloads.
As I mentioned in another comment: in the case of Python it was an accidental property that fell out of making dicts more efficient, so clearly it’s not all that horribly inefficient.
Aside: in Go maps aren’t ordered (explicitly randomized actually). I just used this as an example where it would be convenient.
I 100% agree. If I want this particular behavior (a queue without duplicates) I do it by combining a queue and a hashset (my main use case is graph traversal). In general that might involve keeping things in the set even after they’re processed, which this jack-of-all-trades, master-of-none data structure fails to accomplish.
Why do people insist on having one single data structure that does everything?
In the case of Python it was more of an accidental side-effect that fell out of a new dict implementation. People liked it and it was added to the language specification. It’s more memory-efficient and either equal or faster in runtime performance. It was inspired by the more performance-focused PyPy.
If it was slower or if there was some weird confusion between data structures like PHP’s “arrays” – which can be either an an “array” or “hashmap” with some surprising “magic” conversions and fairly high performance impact – then, sure, I’d agree. But nothing at all was sacrificed here, and it’s convenient in quite a few cases. There is no trade-off, no “mastery” was lost. It’s just a win.
There’s a loss in that it’s now unnecessarily constraining future implementation options. Every promise is a loss of flexibility. Even if an order-preserving hash table was the best available implementation now, who is to say that it’s still the best one in two years?
Given that Python has managed quite well without this more efficient implementation for almost 30 years, that this has been researched to some extent and spectacular breakthroughs in the foreseeable future are unlikely (though not impossible of course), and that insertion order is considered a desirable feature by many, I think the trade-off is a pretty good one.
Python has managed quite well without efficiency in all areas. I wouldn’t look there if you’re looking for arguments about data structures where efficiency matters.
I also agree that insertion order is something that I’ve wanted incredibly rarely.
If you have one data structure you can have many types of transformations. It’s harder to get that level of depth with many data structures.
That said you do probably want more orthogonal data structures. Otherwise, you can have bad performance problems. E.g., using a linked list for everything in functional languages can have nasty performance issues.
PHP preserves insertion order and that’s used for all kinds of things. It can be used to deduplicate arrays. It can be used to round-trip tree structures (e.g. edit a config file without sorting the keys). It’s even used for DSLs.
It can be thought of as a
vector
that comes with a fast lookup index. Of course you could achieve the same (and more flexible) with a vec and a separate hashmap, when it’s a single data structure it’s just easy, and there’s no need to worry about keeping consistency between vec and its index.In the implementations that I’m aware of (Python and Rust’s
indexmap
), this sort of hashmap tends to be faster in macro-benchmarks. This is likely due to iteration being faster over the dense array of KV pairs, and iteration is pretty common. Certainly that is worth caring about.I definitely reach for
indexmap
all the time. Insertion order is useful sometimes, but sorting the map is also frequently useful. Also, just being able to quickly access KV pairs via indexing is a great way to avoid hashing in many cases.That’s very workload dependent. A lot of the places I’ve used maps, I never iterate: I want associative lookup and that’s why I chose a map as the interface to the underlying data structure. In some uses, I really care about fast insertion and removal, in some I care about lookup performance at the expense of everything else. In some I care about a broader balance. If I care about iteration performance then I might want to use a map with guaranteed iteration order but for the use cases where I don’t care about iteration performance (I’d guess that most of the code I’ve written that iterates over a map does so purely for debugging / asserting invariants on the map contents) I don’t want a set of additional constraints.
Last time I benchmarked a map on anything I cared about, Hopscotch and Robin Hood maps had the best overall performance (one was slightly lower memory consumption, the other slightly faster throughput). The maps with guaranteed iteration order that I tried were measurably worse in both metrics.
I agree the String thing is confusing, in fact the author didn’t list quite a few different string types that exist, and listed no where near the amount of string conversions or ways to get string-y arguments. However, it’s one of those cases where the underlying problem Rust solved with this confusion actually exists in (nearly?) all languages. Rust, in typical Rust fashion just makes you aware of all the footguns up front. Once you wrap your head around when to use each type, or what their various tradeoffs are, it makes perfect sense. So much so that I’ll get frustrated with other languages that paper over these details leading to bugs. Bottom line, strings in general are hard, really hard.
I think the distinctions that Rust make are useful and necessary. However, I think one of the problems is that the types are confusingly named. I think
String
should have been calledStringBuf
andOsString
OsStringBuf
, just as you havePath
and PathBuf`.I think an additional problem that make slices (
[T]
) and string slices (str
) difficult to understand is that they are unsized, built-in types. So, people have to understand the difference between e.g.&str
andstr
and why you cannot just putstr
in e.g. a struct. I know that there are good reasons for why string references are as they are, but I think from a learning curve perspective, it would have been easier if string slices were something along the lines of a simpleCopy
able type:Having references to unsized slices is necessary to avoid a performance hit. The
StringSlice
type above is 1 word larger than&str
(which is just a pointer and a length). More importantly it has an additional layer of indirection:buf
points to theStringBuf
which points to the data, while&str
points directly at the relevant data.You don’t have to convince me. Like I said, there are good reasons for the current representation. It’s just that it makes (string) slices more opaque.
This is also quite opposite to many other types Rust, which are transparent and can be understood by just reading the standard library.
One of my favorite examples is the
BinaryHeap::peek_mut
method, which would be completely unsafe in another language (since you can modify the tip of the heap, which invalidates the heap property), but in Rust it can be done without any magic. The borrow system takes care of that you can only have one mutable reference (so no-one else can have a view of the heap when the heap property is temporarily broken), theDrop
implementation ofPeekMut
takes care of restoring the heap property when necessary.Tom7 (of SIGBOVIK fame) has a similar project on “reverse-emulating” the NES: https://youtu.be/ar9WRwCiSr0
There is some really cool research out there that combines programming and direct manipulation: https://ravichugh.github.io/sketch-n-sketch/
Thanks a lot for the link, I never heard of it and it’s super relevant (especially the list of related papers). How did you hear about that project? I’m surprised I missed it!
There was an article posted to lobste.rs in March that links to sketch-n-sketch, and a few other systems. That’s where I saw it. https://lobste.rs/s/8tkitt/end_user_programming
Ha, I just skimmed through it at the time and missed the reference. I’m glad it came back!
I’m a PhD student, and we’ve read that paper in our reading group. I’ve also seen the talk for this work. It’s a pretty impressive demo!
I’m excited by this, but a little turned off but having to run a watch instance for each project. Thankfully, it looks like people are already on it!
As someone who’s been toying with making a game for fun for a while, I’ve been excited by ggez. It’s very easy to get started with something on the screen (just like LÖVE, as advertised). But you can also write tight loops and such and not worry about performance as much.
Glad to see the author prioritizing their own needs and taking a break. Hopefully the community picks this one up (I’d like to as well!).
Please note this is from April; a lot has happened.
This week, the networking WG is supposed to be making some posts overviewing the state of play as it is today. We have landed futures in the core library as well as the first implementation of async/await in the compiler. Still more work to do, though!
Exciting! What’s the best way to stay up to date on all this?
This Week in Rust collects the latest news, upcoming events and a week-by-week account of changes in the Rust language and libraries.
The Rust Blog is where the Rust team makes announcements about major developments.
And nearly everything happening in Rust is discussed on the unofficial subreddit, /r/rust.
Probably following the network working group newsletter; one should be out this week with a summary of where we’re at with everything, and what is left to be done. I’ll make sure it gets posted to lobste.rs.
Here’s an update on this: https://lobste.rs/s/anlitu/futures_0_3_0_alpha_1
So what’s the current state regarding implicit vs. explicit execution? Last time I checked there were both explicit executors and
poll
.I think you’re asking about Tokio; the standard library doesn’t provide an executor. The executor is implicit. You still call
tokio::run
, but that’s it. See here: https://tokio.rs/blog/2018-03-tokio-runtime/poll
is still the core of how Futures works. https://doc.rust-lang.org/nightly/std/future/trait.Future.htmlSo
future.map(...).filter(...)
won’t start executing until it is polled explicitly? I found the documentation to be somewhat silent on that.Yep!
Thanks, good to hear.
What’s the benefit of this as opposed to a more “modern” protocol like matrix?
Generally the newfangled stuff has to be justified and not the other way around.
The newfangled stuff is justified by the loss of mindshare irc has experienced in favour of apps like slack or whatsapp
Personally, I just Keychain for iOS and OS X. It takes very little setup, it’s definitely secure enough for my personal use case, and it Just Works. Generating and storing random passwords across all my iDevices couldn’t really be easier.
What exactly is Pony, again?
From their site: “Pony is an open-source, object-oriented, actor-model, capabilities-secure, high performance programming language.”
Why don’t you filter out every tag instead of the ones you want to see, temporarily? I do not think it is often the case that people want to see a particular combination of tags… but just for that day.
Consider the compilers and compsci example. If there’s just a few tags I want to look at, filtering out all the rest (and then unfiltering when I’m done) is rather cumbersome.
I finished The Martian last night. It’s an amazing book. It’s sci-fi, but with enough realism in it to make is sound plausible. The stranded man is an engineer, so there are many detailed descriptions of how he fits things together to make new things he needs.
Read this last week, only took a few hours as it pulled me right in. Fantastic book, excited for the movie; though I’m a bit weirded out that Dr. Mann is going to be stranded on Mars..
I’m excited to start it; it’s next on my (ever-growing) list after I finish Infinite Jest.
How is the writing itself? I’ve heard great things about the realism in the sci-fi, but not much about the story or prose outside of the technical details.
The writing is good. It’s compelling. The author explains things clearly and pretty much every chapter ends on a cliffhanger.
There’s for sure sections of the book that read like a really interesting spreadsheet…but it doesn’t fall into that pit of genre fiction where they spend too much time on genre and not enough time on solid writing. It’s good.
I’d recommend Rocket Girls for something vaguely similar. Also the nonfiction Moondust: In Search of the Men Who Fell to Earth.
Well, today I learned that
set -o vi
is going into all my.bashrc
’s from now on.One thing I’ve found useful is
ctrl+x ctrl+e
to edit the current command in$EDITOR
. That way I can leave the emacs-style keybindings at the shell, as you’ll find that everywhere, but you can still edit unwieldy commands in real vim.Best part: Bash has it by default.
Or, in your
/etc/inputrc
or~/.inputrc
:I am (and have been for the past few weeks) still working through Infinite Jest. I’ve been enjoying it for the most part; though there have definitely been some parts I’ve had to slog through.
I’ve wanted to start reading some more educational material, particularly around PL and compilers, but I’ve been a little daunted by the idea of reading it in parallel with IJ. Suggestions and what and how to read would be appreciated!
I’ve heard only good things about Let’s Build a Compiler. Anyone with further good things about it to share with us?
Thanks for the reference, although I’m looking for something a couple steps beyond that as I’m already familiar with type theory and basic compiler implementation.
And good things about IJ? It’s an unprecedented and fascinating look into addiction, depression, and American views on these topics. Those are just the central themes (I’d say, anyway), and the author has much more to say along the way.
I’ve read the Dragon Book (Compilers: Principles, Techniques, and Tools, Aho et al.) at the time. It’s quite good (considered the seminal text on these matters), though if you’re already familiar with the basics, I don’t know if it’ll help you much.
Is anyone out there using Atom as their primary editor? I’d love to hear something from that point of view.