First, Julia ought to get some MacArthur-style grant that includes boxes with many gigs of RAM and copious credits for zine printing. I think the Internet would benefit.
Second, on her experience of SQLite being too big a jump, I have often been bugged by the sparsely-populated space in between data structures in RAM and a database. Maybe I need a bigger-than-RAM lookup table but not a query planner or transactions. Maybe I just want to join two possibly-big datasets or sort one of them. I am not sure why–maybe efficient serde is too often language-specific or absent?–but there’s tons of battle-tested in-heap stuff, the chaotic In-Between, then tons of battle-tested full-fledged DBs. Does the world need a toolkit of lots of deconstructed database pieces?
To be fair:
She notes it might’ve been possible to coax a better execution plan out of SQLite. In a way that still meshes with the whole “give me a plain lookup table” thing–with a lookup table you do the query planning!–but, still, slightly different story if SQLite could do it faster than it did.
One thought–in the kind of table she describes, a condition like ip between start_ip and end_ip could lead to scanning all rows where start_ip<=ip. If so, a non-scan-y alternative would be to look for the row (if any) with the highest start_ip <= ip (so, select where start_ip <= ip order by start_ip desc limit 1). Then have your code (not SQL) check end_ip to catch if the queried IP is actually in unallocated space. (And you’d get no row if passed an “IP” in the unallocated and possibly invalid land of, like, 0.1.2.3.) Looking for any range that ‘should’ contain the IP should just be picking one item out of a sorted collection just like the binary search was.
By “sparsely-populated” above I don’t mean nothing but in-ram and DBs exists, just that arrays and structures in RAM are everywhere and full-on databases are everywhere but you have to dig a little more for the rest. If someone has good projects to mention here, awesome.
It’s likely that more database-pieces wouldn’t have won here. The on-disk option lost by a lot to a binsearch of an array in RAM, and probably even a Really Good lookup, cached, etc. would lose badly in a test where it ever has to hit disk. Fast disks are still a lot slower than RAM!
Im curious how sqlite using in-memory would have worked (seems unfair to compare a disk based store to RAM), and if https://www.sqlite.org/rtree.html would have helped at this scale.
Edit: my memory of R trees seems to have gotten mixed up. They’re for when data is points and queries are ranges. This post is about a situation where queries are points and data is ranges.
Great article, I hope there’s a follow-up article in which we get to know if the OOM problems definitely went away!
I now also really want to read the guide that she links in the article on how to use pprof.
PS: another possible optimization might be to use a very lean OS in the VM hosting the webservice (one of the BSDs?) so that in theory you could allocate even more memory to the webservice.
First, Julia ought to get some MacArthur-style grant that includes boxes with many gigs of RAM and copious credits for zine printing. I think the Internet would benefit.
Second, on her experience of SQLite being too big a jump, I have often been bugged by the sparsely-populated space in between data structures in RAM and a database. Maybe I need a bigger-than-RAM lookup table but not a query planner or transactions. Maybe I just want to join two possibly-big datasets or sort one of them. I am not sure why–maybe efficient serde is too often language-specific or absent?–but there’s tons of battle-tested in-heap stuff, the chaotic In-Between, then tons of battle-tested full-fledged DBs. Does the world need a toolkit of lots of deconstructed database pieces?
To be fair:
ip between start_ip and end_ipcould lead to scanning all rows where start_ip<=ip. If so, a non-scan-y alternative would be to look for the row (if any) with the highest start_ip <= ip (so, selectwhere start_ip <= ip order by start_ip desc limit 1). Then have your code (not SQL) check end_ip to catch if the queried IP is actually in unallocated space. (And you’d get no row if passed an “IP” in the unallocated and possibly invalid land of, like, 0.1.2.3.) Looking for any range that ‘should’ contain the IP should just be picking one item out of a sorted collection just like the binary search was.Im curious how sqlite using in-memory would have worked (seems unfair to compare a disk based store to RAM), and if https://www.sqlite.org/rtree.html would have helped at this scale.
Edit: my memory of R trees seems to have gotten mixed up. They’re for when data is points and queries are ranges. This post is about a situation where queries are points and data is ranges.
Great article, I hope there’s a follow-up article in which we get to know if the OOM problems definitely went away!
I now also really want to read the guide that she links in the article on how to use pprof.
PS: another possible optimization might be to use a very lean OS in the VM hosting the webservice (one of the BSDs?) so that in theory you could allocate even more memory to the webservice.