University finished, so I have a bunch of free time to work on personal projects. This week I’m continuing work on a container runtime built from scratch. So far it’s been fun playing with the various low level linux APIs (namespaces, cgroups, overlayfs etc). Right now it can pull and ‘boot’ docker images, but work needs to be done around populating /dev and using cgroups properly.
Quick demo of pulling images: https://asciinema.org/a/08p212uouswkc8h7j1p9z46bj
Yes, Ruby uses quicksort and Haskell uses mergesort. It’s really easy to find out, and important since different sort algorithms have very different properties.
Knowing the algorithm tells you the asymptotic behaviour but that can be surprisingly useless - the constant factors from e.g. cache usage patterns can easily end up being multiple orders of magnitude (and you simply can’t understand those without understanding the details of the hardware - reasoning about the abstract Haskell/Ruby machine won’t tell you anything)
Have you ever switched to a custom sorting function because the one in your standard library was inappropriate?
Not personally. I would imagine that database implementations (including key-value stores) are the only software that needs to do this, these days. Knuth has an entire chapter on external sorts, optimized for different situations involving data that doesn’t fit in memory, but in the modern era we no longer write one-off storage backends with that level of complexity. I guess it’s nice to remember that computing can make real advancements sometimes…
If I know a sort is constrained somehow (i.e. a known interval of integers) I’ll use a radix sort or similar.
Radix sort is kinda amazing when one can get away with it. :)
O(n) sorting is so sweet but I’ve never hit the case where my data was simultaneously that constrained and sorting it was a bottleneck.
I’ve occasionally had that case (constrained data, sorting is a bottleneck), though not often. The last time I remember it was when I was doing some stuff with the Netflix Prize dataset, which is a giant set of movie ratings, each of them an integer 1-5. For some machine learning algorithms (e.g. decision trees), sorting the data by an axis is a bottleneck. Although even in that case, careful fiddling with data representation to be more cache-friendly made a much bigger difference to my runtime than radix sort did.
Yes. For suffix array construction, or when I require a stable sort and the standard library uses a non-stable sort.
Yes, quicksort is often not the best choice because of its O(n^2) worst-case complexity.
The quadratic case can be easily avoided by using a non-dumb process to pick a pivot: random, median-of-three, etc.
The bigger problem with quicksort is usually that it’s not stable.
You should probably test this before throwing out the stdlib sort, though.
These days most, if not all, standard libraries use smarter versions of quicksort to eliminate the O(n^2) case, or at least make it a lot harder to hit. A lot of them don’t even use quicksort any more, but other algorithms (like timsort) with similar performance and guaranteed worst case O(n log n) behavior.
I know python, my go-to language, uses timsort. Timsort is complex enough that I don’t really know what is is doing under the covers at any point in time, however.
If you’re more into RestructuredText, there’s also presentty: