Having just read the recent SQLite paper about past, present and future, as well as the SQLite/HE paper, I read this post with my mind primed. It seems like modern Databases are reaching a point of efficiency where the only way to improve is by having specific functionality and algorithms for specific data access patterns.
For instance, SQLite/HE is a separate SQLite query execution engine and data store specifically for queries it identifies as OLAP queries, which brings impressive performance gains to that specific subset of potential queries. And with roaring bitmaps, which I haven’t checked, but I imagine SQLite is using as well, it’s the same concept. Specific containers and join-algorithms depending on the data present, for impressive space efficiency.
I don’t really have much of a point to any of this, other than to be consistently amazed by all of the technology behind databases. There is so much at work to make my poor, inefficient SQL statements return quickly :)
I think that’s been one reason for the success of SQL: as a declarative language, it leaves the execution strategy up to the implementation. That allows for a wide variety of strategies focused on different scales and domains, which can all be hidden away from the SQL programmer.
These are interestingly … simple. I was expecting some super clever data structure or algorithm, but the original design is just “lets break the domain into small chunks, and have two different ways to store the integers in a chunk, one sparse and one dense, using the obvious data structure for each one.”
The later addition of run-length encoding makes it a bit more complex, but not too much. I have an IntSet class of my own that uses purely RLE; it’s based on a C++ std::map<int,int>, i.e. a sorted set of {start,length} pairs. This is a pretty fun programming puzzle to implement, with some interesting edge cases — it’s conceptually straightforward to implement methods like adding a range of ints, but you have to be careful with the coding.
And then of course there’s optimization. The C implementation of Roaring uses SIMD, which must get pretty involved.
Having just read the recent SQLite paper about past, present and future, as well as the SQLite/HE paper, I read this post with my mind primed. It seems like modern Databases are reaching a point of efficiency where the only way to improve is by having specific functionality and algorithms for specific data access patterns.
For instance, SQLite/HE is a separate SQLite query execution engine and data store specifically for queries it identifies as OLAP queries, which brings impressive performance gains to that specific subset of potential queries. And with roaring bitmaps, which I haven’t checked, but I imagine SQLite is using as well, it’s the same concept. Specific containers and join-algorithms depending on the data present, for impressive space efficiency.
I don’t really have much of a point to any of this, other than to be consistently amazed by all of the technology behind databases. There is so much at work to make my poor, inefficient SQL statements return quickly :)
I think that’s been one reason for the success of SQL: as a declarative language, it leaves the execution strategy up to the implementation. That allows for a wide variety of strategies focused on different scales and domains, which can all be hidden away from the SQL programmer.
These are interestingly … simple. I was expecting some super clever data structure or algorithm, but the original design is just “lets break the domain into small chunks, and have two different ways to store the integers in a chunk, one sparse and one dense, using the obvious data structure for each one.”
The later addition of run-length encoding makes it a bit more complex, but not too much. I have an IntSet class of my own that uses purely RLE; it’s based on a C++
std::map<int,int>
, i.e. a sorted set of {start,length} pairs. This is a pretty fun programming puzzle to implement, with some interesting edge cases — it’s conceptually straightforward to implement methods like adding a range of ints, but you have to be careful with the coding.And then of course there’s optimization. The C implementation of Roaring uses SIMD, which must get pretty involved.