I used to work in the geospatial space and around 10 years ago we build a spatial indexing system using z-order curves a.k.a. Morton Curves (https://en.wikipedia.org/wiki/Z-order_curve) on top of HBase. We took advantage of the sorted nature of HBase and the way one can do scans through the data. The keys were using base-4 encoding, which allowed to “zoom-in” and “zoom-out” by simply changing the key prefix lenght of the scan we were doing.
Oh I can totally see how that would work too. Its easy to see the benefit of the Morton Curve in terms of how the algorithm to generate the keys and query ranges is so simple by comparison.
It has the same problem that the Hilbert Curve does where there are some areas that are close in terms of [x,y] but far away in terms of how far along the curve they are. So if you have a query rectangle which spans those areas, its possible it would be better to split it up into two or three rectangles.
For the visually inclined here’s a picture showing how the base-4-numbered morton curve quadrants would be laid out:
And just for comparison here is the same, but with the hilbert curve and base-10 numbers:
Can you develop the idea of base-4 encoding and zooming?
The z-order curve is really expressing a grid system using squares. Squares are further subdivided into squares etc. etc. You can take that as fine as you want, but typical is 7 or 8 digits (at least when we did it). The curve is calculated by interleaving the digitis (as integers) of coordinates. See Wikipedia for pictures.
That means base-4 tells you the sub-grid quite easily. All things that are in the same sub-grid have the same prefix in base-4 encoding and that logic works on any level. Imagine the world projected flat and you have one big square. The square can be divided into 4 more squares. Each of these can be further subdivided. Everything that starts with 0, is however in the first sub-square of our first level. If the interleaved value of the curve starts with a 1, we know that it can not be in the first sub-square. This goes on and on. So each digit in the number really becomes a zoom level.
Now if i want to do a spatial query i can calculate a prefix to scan for for a given bounding box and run the scan against a system like hbase. You basically tell Hbase where to start reading the data. The client can calculate the prefix by itself since the algorithm is so simple and static.
Does that make sense?
I implemented something like this circa 2015 and used the existing geohash system, which turns geographic coordinates into ASCII strings. According to that article it uses a Z-order curve.
Using geohash was my first thought when the author wrote about key/values.
It works fairly well in practice.
tangentially related, I love the aesthetics of this oberon OS screenshot with a hilbert curve
Are there any benefits in terms of data locality for this method vs. sequential file storage/access? It would be nice to see some benchmarks if so.
Yeah, it would be fun to try some benchmarks. My intuition tells me that the method I’ve developed is mostly useful for large universes with lots of data and queries of different sizes and dimensions, tall, wide, small, large, etc. Or if you want to begin indexing data without knowing what the queries will look like in the future.
If you only have to support one kind of query, like a square of a certain size, indexing the data as [x,y] or [y,x] with the first, most significant dimension rounded to approximately the size of your query rectangle would probably be just as performant for reads. You would do multiple range queries over those “rows” or “columns” like I described in my “naive approach” example. But as soon as the size of the queried area can vary by orders of magnitude this approach starts to break down, and you start seeing an unnecessary trade-off between wasted IO bandwidth and number of range queries.
Ok, I couldn’t help myself, I did do the benchmarks: https://sequentialread.com/building-a-spatial-index-supporting-range-query-using-space-filling-hilbert-curve/#benchmarks
Nice, thanks, that’s awesome!
Can you give a little interpretation on this, though? The “number of range queries per rectangle” is high while the query time is small for small to tiny queries. This means that the Hilbert curve method makes many range queries in a rectangle compared to the sliced method but still outperforms it by 2x-7x?
I think it’s because bandwidth is the limiting factor in my test in that case. Look at the “wasted bandwidth” metric for the tiny queries, the query time graph looks a lot like the wasted bandwidth graph. The CPU also has to do work to throw out all the entries outside the rectangle. A wasted bandwidth metric of 2 means that there were twice as many entries found outside of the query rectangle than there were entries found inside. And it goes up to 18 in the worst case!! in other words, for the 32-pixels-wide slice (universe is 16 slices tall) only about 5% of the data that the query was hitting was actually inside the rectangle.
I will say I did this benchmark in 1 day and I didn’t try to optimize it much so its possible that there are other bottlenecks getting in the way of the accuracy of the results. So take it with a grain of salt.
The main thing I wanted to show was how the space filling curve method works no matter what the scale of the query is. Even though they look different because the graphs have different scales, the amount of wasted bandwidth and range-queries-per-rectangle stays constant across all the different query sizes for the hilbert index.
Also, you can tune its performance characteristics a little bit at query time – in other words different kinds of queries can be tuned individually with no extra cost. While the sliced method outperforms the space filling curve method when the slice size is tuned for the query, the problem is you have to re-index the data every time you want to tune it to a different slice size.
What do you mean by sequential file storage/access? Do you mean scanning through the whole spatial index data?