1. 62
  1. 19

    I think this has missed the point that is to central to modern databases - the memory hierarchy means that random access is expensive once your data-structures grow larger than available cache. This is especially true for writing to disk, where to maximize bandwidth you want to be writing to the disk in blocks at a time, but the same concerns apply to in-memory computation too.

    In memory code that processes large amounts of data non-linearly (eg joins) is typically best written either using sorting, or using hashing on cache-sized partitions of the data. A lot has been written in the database field about the duality of hashing and sorting (eg https://www.researchgate.net/profile/Ingo_Mueller/publication/274952743_Cache-Efficient_Aggregation_Hashing_Is_Sorting/links/5583ee1e08ae89172b86228c/Cache-Efficient-Aggregation-Hashing-Is-Sorting.pdf) and the preference often comes down to details of the hardware.

    One can think of btrees and hashtables as being methods of incrementally sorting or hashing data respectively. For large data, even if you use hashes you still need some way to group writes rather than spreading them randomly across memory. Hybrid methods with tree-like structure and per-node hashing are often viable eg https://db.in.tum.de/~leis/papers/ART.pdf or http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf.

    1. 3

      Indexes on things that’re correlated with time (ID, timestamp) tend to concentrate recently added data towards the “end” of the table/index, which is a big help when you have years of data but mostly people are asking about the last few weeks.

      It gets a couple sentences in the post, but prefix searches are key (ha!): a sorted index on (foo, bar, baz) helps not only searches on all three attributes but also foo=x and bar=y or just foo=x. It can also be useful when you want to look things up by a prefix of an individual column: filter to a given day or month using a timestamp index, or filter to errors reported from client version 3.2.x. As a “covering index” it can somewhat help with queries that limit to foo=x and do something with bar (filter on it, return it, sum it, etc.). Finally, it helps with multiple scenarios for aggregation, not only searching: a timestamp index can let you get a count of orders by day, hour, or second without needing to make another temporary table/index.

      Often the crucial thing for keeping a DB humming is not getting the last couple percent out of already-fast operations but making sure you don’t have any disastrous queries, so an index being versatile for many kinds of search is especially valuable.

      Related to that last point, databases are trying to prepare for queries that haven’t even been written yet, while when you’re coding some algorithm that involves a hashtable, you already know exactly what lookup you’re going to do a few lines later. That on its own is going to push you towards more flexible types of indexing.

      1. 2

        This sort of complexity trade-off makes me wonder if there is some way to embed database query optimization while working with collections …

        That is the kind of things that is allowed by SRFI-167 That is you can work with an in-memory implementation (the sample implementation). At the end of the day it is only a tree, but the procedure that are exposed (inspired from FoundationDB) are much more explicit about how to compose those procedure to build bigger / higher abstractions.

        Also, I plan to use SRFI-167 and SRFI-168 as the store in a ReactJS application, that is a purely in memory use-case.

        1. 1

          I think in the case of C the answer is more mundane: it’s easier to write a basic hash table than a b-tree map. Without access to generics and/or easy dependencies you end up writing your own containers, and you won’t write a more complex one if you don’t have to.

          But memory in modern systems is not really random access any more, and b-tree may be a better choice: http://dtrace.org/blogs/bmc/2018/09/28/the-relative-performance-of-c-and-rust/