Its interesting that you are looking at this topic. At $dayjob, we are also talking about this recently. Together with erasure coding, i think these are very important features to have in a storage system.
Very excited to read this! Rsync has always been sort of black magic to me. I’ve implemented file-based deduplication a couple of times, but would like to extend it to work better with similar files or multiple edits of a file.
This seems like something that would me a good reusable library: you give it a file, it chonks it up and adds it to a (sqlite?) database and gives you an identifier, and then you can use the ID to recover the file. Even better, you can get the list of chonk digests of two files and use that to do delta compression a la rsync.
So I stumbled across this Google Fast CDC implementation a few weeks ago, and it says it’s different than and more efficient than rsync.
That was news to me, as I always thought “rsync rolling checksum” == “content-defined chunking”, but apparently that is NOT the case. I think maybe I was confused because I read about both at roughly the same time (more than 10 years ago)
I guess rolling checksums are still fixed length, while content-defined chunks are variable length? (edit 5 minutes later: yes now this seems obvious to me. I can see rsync still works to some extent in the presence of deletions and insertions, but CDC should be faster)
Hacker News comments seem to agree … i.e. nobody took issue with their claims. Also there is a 2016 paper which rsync could not have implemented. (I seem to remember it was the author’s Ph.D. thesis in the 90’s)
The remote diffing algorithm is based on CDC. In our tests, it is up to 30x faster than the one used in rsync (1500 MB/s vs 50 MB/s).
What makes rsync relatively slow is the “no match” situation where the rolling hash does not match any remote hash, and the algorithm has to roll the hash forward and perform a hash map lookup for each byte. rsync goes to great lengths optimizing lookups.
cdc_rsync does not use fixed-size chunks, but instead variable-size, content-defined chunks. That means, chunk boundaries are determined by the local content of the file, in practice a 64 byte sliding window. For more details, see the FastCDC paper or take a look at our implementation.
In retrospect, that Github README is more informative to me than this post … (and also has good diagrams). This post does not mention rsync at all, only abstractly about “fixed size chunking”.
I’m not quite sure why Buz hash is not more mentioned in the context of CDC/rsync algos. It’s quite a bit simpler to both explain and implement than the Rabin stuff mentioned in the above article.
Bitbottle also uses a variant of buzhash, and it seems like a great system. I’m not sure why it’s not mentioned often in the new surge of interest in rolling hashes. :(
Yeah, I remember reading about that. It probably influenced my decision to store blobs/attachments the same way in Couchbase Mobile. (Of course the drawback is you have to GC periodically. Or else keep buying more disks.)
In 2014 I read how rsync worked and the idea of CDC occurred to me after a few days of thinking about rolling hashes. I cooked up a tiny Java example, which I just checked and still have. I’ll post it somewhere if there’s any interest and you all promise not to laugh at my ten-year-old self-as-audience throwaway PoC Java :)
Its interesting that you are looking at this topic. At $dayjob, we are also talking about this recently. Together with erasure coding, i think these are very important features to have in a storage system.
Very excited to read this! Rsync has always been sort of black magic to me. I’ve implemented file-based deduplication a couple of times, but would like to extend it to work better with similar files or multiple edits of a file.
This seems like something that would me a good reusable library: you give it a file, it chonks it up and adds it to a (sqlite?) database and gives you an identifier, and then you can use the ID to recover the file. Even better, you can get the list of chonk digests of two files and use that to do delta compression a la rsync.
(Damn, now I want to write this…)
So I stumbled across this Google Fast CDC implementation a few weeks ago, and it says it’s different than and more efficient than rsync.
That was news to me, as I always thought “rsync rolling checksum” == “content-defined chunking”, but apparently that is NOT the case. I think maybe I was confused because I read about both at roughly the same time (more than 10 years ago)
I guess rolling checksums are still fixed length, while content-defined chunks are variable length? (edit 5 minutes later: yes now this seems obvious to me. I can see rsync still works to some extent in the presence of deletions and insertions, but CDC should be faster)
https://news.ycombinator.com/item?id=34303497
Hacker News comments seem to agree … i.e. nobody took issue with their claims. Also there is a 2016 paper which rsync could not have implemented. (I seem to remember it was the author’s Ph.D. thesis in the 90’s)
https://github.com/google/cdc-file-transfer
In retrospect, that Github README is more informative to me than this post … (and also has good diagrams). This post does not mention rsync at all, only abstractly about “fixed size chunking”.
A rolling checksum is a great way to implement CDC, but that’s not how rsync uses it.
@snej - you might find this (incomplete) package interesting: https://github.com/c-blake/ndup
I’m not quite sure why Buz hash is not more mentioned in the context of CDC/rsync algos. It’s quite a bit simpler to both explain and implement than the Rabin stuff mentioned in the above article.
Bitbottle also uses a variant of buzhash, and it seems like a great system. I’m not sure why it’s not mentioned often in the new surge of interest in rolling hashes. :(
Hm, this kinda sounds like Plan 9’s Venti on steroids.
Yeah, I remember reading about that. It probably influenced my decision to store blobs/attachments the same way in Couchbase Mobile. (Of course the drawback is you have to GC periodically. Or else keep buying more disks.)
Looking forward to more pieces. I only recently discovered CDC, and it seems like a really interesting topic.
In 2014 I read how rsync worked and the idea of CDC occurred to me after a few days of thinking about rolling hashes. I cooked up a tiny Java example, which I just checked and still have. I’ll post it somewhere if there’s any interest and you all promise not to laugh at my ten-year-old self-as-audience throwaway PoC Java :)