What’s the use case for this. If you have the file, why not just reading it directly?
To present a unified query interface across different data sources. The core of the query logic is in fact factored out as an independent crate so it can be used in embedded mode where downstream applications want to handle small files directly at runtime.
But for larger datasets that requires a lot of memory to hold, you would want to load it once in an API so downstream services won’t need to do that. The query results are usually a lot more smaller than the dataset itself. There is also the case where dataset owner wants to maintain the update of dataset in a single place instead of embedding the files (or uri) into downstream services, serving those datasets as an API is a good way to do that.
Once this starts up will it reflect changes to the underlaying datasources i.e - if a separate process updated the backing csv/json/google doc? I’m wondering if this would be useful for a event sourced read API.
Currently it assumes the dataset is static, so a redeploy is required to pick up changes from the dataset.
I do want to add support for consuming updates from the dataset in a streaming fashion, effectively supporting a event sourced read API use-case you proposed. It should be pretty straight forward to implement. We just need to subscribe to the stream and append new data as Arrow record batches into the in memory table.