I think there are three interesting ideas in this kind of system:
The first is the observation that underlies bloom filters, skip lists, and a lot of other data structures: If you have a very cheap way of getting a half-arsed almost answer, then you can often do an O(n) thing to get from your approximation to the right answer and still outperform a formally correct algorithm because your guess reduces n to a much smaller value. That’s not a particularly new observation but it is often overlooked.
The second is that most data is not a uniform random distribution. If you know something about the distribution of the data then you can often do better than an algorithm that is optimal for a uniform distribution. Again, this is not particularly novel, and is often used in real-world sorting algorithms in systems software, to special-case common values. Outperforming a radix sort is nice, but it’s quite common in systems software to do a bucket sort, which is O(n), to get the most common values out of the way and then do something slower on the values that end up in the slow-path unusual-case bucket.
The third, which was actually the conclusion of some of my PhD work so one that I’m quite biased towards, is that you can often apply some fairly simple and cheap machine learning to figure out enough of the characteristics of the data that you can apply the first to observations to get a rough guess that’s mostly good-enough and fall back to something slower for when the guess is wrong. If your data has characteristics that you can look at and figure out how to design an optimised data structure then you may well be able to have a machine infer these characteristics.
The thing I really like about some of the recent work in this area is that they’re starting to separate these three aspects.
Do you have any suggested materials or papers for number three? I’ve seen simple techniques applied in my problem domain very successfully a handful of times, but trying to develop intuition around what set of techniques to consider and when is difficult since so much of the literature is devoted to much grander forms of ML.
Not sure if they fit your need, but here are 2 articles
I don’t, unfortunately, other than the ones cited in the article.