I find that it’s better to split the files into blocks based on their content (when a rolling checksum reaches a certain value), rather than at a fixed size, although you can easily add min/max sizes. This way there’s no need to scan both source and destination using the rolling checksum. Simply send strong hashes of the individual blocks, then send any blocks that the destination does not have. Changing an individual byte, even adding or removing a few of them will only change 2 or three blocks around the change, the rest will remain the same.
If you then take your file index (a list of hashes for file X) and apply the same process iteratively, you effectively create something akin to a Merkle tree, and updates to even large files will become very small over the wire, with the receiver having to only receive updated blocks of file and/or index.
Edit: While I did think this up independently, I believe it’s what TarSnap does, so I don’t claim it to be original :)
I have a C library that does this, based on the the rsync algorithm: hashchop
It’s a core part of a content addressable store / distributed de-duplicating filesystem project I’ve been working on for a while. I have a rough prototype in Lua, called tangram, though I plan on rewriting it as a C library soon.
Mine also uses “something akin to a Merkle tree” (which I call a Jumprope). It was described, somewhat, in a talk I did at Strange Loop, but the Lua code in tangram is the reference implementation.
Very interesting, cheers!
I watched the talk, and I definitely would like to see more details about the jump-rope structure. My ideas about splitting the blocklists aren’t as well thought-out as yours appear to be :)