This is great timing, I’ve been doing a lot of research on wide timeseries events. The Honeycomb StrangeLoop talks are quite good.
The one piece I don’t understand here is the “globally sorted” aspect. Does that mean all of the actual datapoints are globally sorted? If so, doesn’t that mean that queries for a time range will be quite inefficient?
Great question! The answer is: it depends, but overall it’s helpful, because it optimized for scanning large amounts of data. It’s not as optimal for scanning small amounts of data, but that’s also ok because it’s small so we can still answer queries with low latency.
Because data is sorted, we can essentially skip through the btree of blocks of data using something that resembles binary search a lot (maybe there’s a name for it that I don’t know).
Hmm it’s still not clear to me exactly why this is advantageous. I think adding an example query in the Immutable & Sorted section would help illustrate why it’s helpful.
Another question I have is the handling of labels. I’m not very familiar with Parquet, does this dict contain all of the labels for a particular datapoint?
Hmm it’s still not clear to me exactly why this is advantageous. I think adding an example query in the Immutable & Sorted section would help illustrate why it’s helpful.
Agreed. I think this deserves some visualizations!
Regarding the column definition. We built the dynamic parquet package on top of parquet, it’s not part of standard parquet, but it uses standard parquet underneath. What this column definition does is it allows any column that is prefixed with labels.* and has the same type and encoding to be accepted, and during compaction fills buffers that did not have that specific column with nulls. It’s different from a dict in that each “dict” of labels is expanded into a column per key.
Awesome, thanks for the info! I’ll dig into the code a bit to better understand the dynamic parquet. I’d love to keep in touch as I’m working on something in a similar space (structured log search).
This is great timing, I’ve been doing a lot of research on wide timeseries events. The Honeycomb StrangeLoop talks are quite good.
The one piece I don’t understand here is the “globally sorted” aspect. Does that mean all of the actual datapoints are globally sorted? If so, doesn’t that mean that queries for a time range will be quite inefficient?
(One of the creators of arcticDB here)
Great question! The answer is: it depends, but overall it’s helpful, because it optimized for scanning large amounts of data. It’s not as optimal for scanning small amounts of data, but that’s also ok because it’s small so we can still answer queries with low latency.
Because data is sorted, we can essentially skip through the btree of blocks of data using something that resembles binary search a lot (maybe there’s a name for it that I don’t know).
Hmm it’s still not clear to me exactly why this is advantageous. I think adding an example query in the
Immutable & Sorted
section would help illustrate why it’s helpful.Another question I have is the handling of labels. I’m not very familiar with Parquet, does this dict contain all of the labels for a particular datapoint?
Agreed. I think this deserves some visualizations!
Regarding the column definition. We built the dynamic parquet package on top of parquet, it’s not part of standard parquet, but it uses standard parquet underneath. What this column definition does is it allows any column that is prefixed with
labels.*
and has the same type and encoding to be accepted, and during compaction fills buffers that did not have that specific column with nulls. It’s different from a dict in that each “dict” of labels is expanded into a column per key.Awesome, thanks for the info! I’ll dig into the code a bit to better understand the dynamic parquet. I’d love to keep in touch as I’m working on something in a similar space (structured log search).
Could you please share the talk you mentioned?
Here they are:
https://youtu.be/tr2KcekX2kk
https://youtu.be/EfL1Fs9PF2Y