The demo is in Ruby, but in Python there’s also an easy way to do this. The zipfile built-in library accepts “file-like objects” in addition to directory paths, which expose a simple read/seek/write interface. With a class that implements range requests behind the scenes you can grab just the portion of the archive you need with no zip-specific code like the directory header bytes in the linked code.
Here is a complete example I have used in production.
I’ve implemented something similar, just to add seek support for fileobj backed by http. Maybe the urllib itself should support mapping seek to range request, so python developers don’t need to re-invent the wheel on their own.
That is actually exceptionally clean:
collection = zipfile.ZipFile(remote.RemoteFileObject(collection_url))
All it needs is read/seek/tell as if it was a file, and then uses ranges underneath. Clever!
Related: GDAL has a really neat virtual filesystem feature that allows you to prepend paths with one or more of /vsi*/ strings to read from remote URLs, peer inside archives, and talk to blob stores on various public cloud providers. Very handy in combination with new designs for cloud-optimized data formats that provide DB-like performance over dumb pipes.
HTTP range requests are underused! They have a lot of untapped potential.
FYI, you can express a range relative to EOF, so you don’t have to do a HEAD first to find the content-length.
I actually thought it might be, but assumed, since it wasn’t an example in the Mozilla docs, that it wasn’t supported. Pretty relevant example to leave out!
Should have read the RFC instead, which clearly have the format of the range specified:
byte-range-spec = first-byte-pos "-" [ last-byte-pos ]
For curl it would just be -H "Range: bytes=<number>-" then.
-H "Range: bytes=<number>-"
I feel like updating the post now!
I just realised I was doubly wrong trying to figure this out. If I’d use the format above, I would still need to know the initial content-length, in order to count backwards from the EOF.
Instead this is the one I was looking for:
A client can request the last N bytes of the selected representation
using a suffix-byte-range-spec.
suffix-byte-range-spec = "-" suffix-length
suffix-length = 1*DIGIT
And a corresponding curl header for only the last 100 bytes would be -H "Range: bytes=-100".
-H "Range: bytes=-100"
hm, that format sounds pretty iffy. If there’s no way to find out where the central directory starts, because the comment at the end is variable-length, it sounds to me like it should be relatively easy to attack zip file readers by embedding the end of central directory record in the comment itself. You can create a confusion between different readers, because it seems to me a correct reader would start from the start and read the entry headers, skipping over entries until reaching the end, and then read the directory record.
I guess they decided to put the directory at the end because that makes it easy to add new files into an existing ZIPfile without rewriting the entire file. But the comment and variable-length part is just asking for trouble, IMO.
Yeah, that sounds unfortunate…
In general, I’m not opposed to putting the index at the end of the file. It makes modification much easier as you say, plus it makes it possible to create the whole file in a streaming fashion. But if you’re putting the index at the bottom, you really ought to make sure it’s actuallly at the bottom, and that putting anything after it is a hard error. It should be possible to seek to the directory without reading through the file.
Wow! Different approach than the one I took for the Gemini protocol. If you go to gemini://gemini.conman.org/test/UCSD-Pascal-source.zip you get the ZIP file, but if you add a ‘/’ to the end of that—gemini://gemini.conman.org/test/UCSD-Pascal-source.zip/ you get a file listing that you can drill down into.
And a web-based proxy to that final link: https://portal.mozz.us/gemini/gemini.conman.org/test/UCSD-Pascal-source.zip/