I really enjoyed the previous entries in this series, and I’m hoping to someday get around to trying (at least) FastCDC for delta-compression.
I found it more difficult to understand what RapidCDC and QuickCDC were doing in this post. Maybe I’m not reading closely enough, but I can see talk of storing previously-seen chunk sizes, and presumably they’re used to predict future chunk sizes, though I can’t quite imagine exactly how that works. The previous entries had nice diagrams and source-code snippets that really helped.
That said, it sounds like RapidCDC and QuickCDC don’t offer much benefit if you have to read the entire file anyway to verify that chunks do not contain inner modifications. Maybe I should just stick to FastCDC and ignore the others.
I found it more difficult to understand what RapidCDC and QuickCDC were doing in this post
Honestly I found the same thing when reading the papers. They seem to be built on all this previous knowledge of how chunking works (unlike the DDelta and FastCDC papers), and so (at least in my reading of them) they feel more set out as just “here are some optimizations to try and skip over a bunch of bytes”.
I can see talk of storing previously-seen chunk sizes, and presumably they’re used to predict future chunk sizes, though I can’t quite imagine exactly how that works
Yea that’s right. If, when we identify a chunk boundary, we have access to a list of the sizes for chunks that follow the current chunk, then we can use those sizes as hints for where the next chunk boundary is. The papers seem to be quite good at spelling out how this works, but not so good at considering how to determine if that chunk is a duplicate or unique. They provide a bunch of options, but none that really seem to work in all cases without also reading every byte.
That said, it sounds like RapidCDC and QuickCDC don’t offer much benefit if you have to read the entire file anyway to verify that chunks do not contain inner modifications.
This was the realization I came to. And I’m worried I’m just missing something because this seems like a pretty big flaw in these two CDC algorithms. Then again, maybe there are some chunking use cases that don’t care so much about chunks being treated faithfully as duplicates 100% of the time, and 99.98% of the time is acceptable.
Maybe I should just stick to FastCDC
This would be my recommendation (and I nearly wrote that in the post). The nice thing about the optimizations in these papers is that they are all pretty orthogonal. Personally, I would start out with FastCDC and then investigate optimizing with techniques from these (and other) papers later down the line if it becomes necessary.
Hi! I recently did benchmark for various deduplicating solutions. I. e. for programs, which maintain some kind of “repo” for storing and deduplicating data, such as borg, casync, etc. Most of them are based on some kind of CDC.
Results are very unfortunate. Turns out FOSS deduplicating solutions miss many optimizations. All they are slower than they should be. So I had to develop my own fast solution called azwyon (it is also included in benchmark above). But azwyon uses fixed chunking as opposed to CDC, because this is what needed for my problem.
Okay, so here are list of problems with existing CDC-based solutions, such as borg and casync. I will use “extract” and “store” as names of basic operations of such solutions.
Borg. Not parallel. Uses slow SHA-256 as hash. Also uses blake2, but it is not default and undiscoverable. I. e. the documentation doesn’t advertise that you can speed up borg by switching to blake2. Also blake3 is even faster, but it is not supported. Also, extracting always verifies hashs, this cannot be disabled
UPD: as on 2023-08-10 borg2 docs don’t say that blake2 mode is faster than no encryption. Docs for borg1 do say this
UPD: SHA-256 is slower than blake2 on my machine. Some machines support special instructions for SHA-256, hence SHA-256 is faster there
Casync. Not parallel. For unknown reasons storing is something like x5 slower than in borg
Rdedup. For unknown reasons extracting is slower than desync. Also, as you can see in benchmark, for unknown reasons storing with FastCDC with block size 64K turns out to be very slow. x10 times slower than with block size 4096K. It may be bug in rdedup or problem with FastCDC algorithm itself
Zpaq. Despite being very old trusted solution, it turns out to be very slow in its default mode compared to alternatives with marginally better compression ratio
So, it seems that current FOSS situation with deduppers is very bad. All them can be easily improved in terms of speed. Or a new fast solution can be created, which will outperform existing solutions
In fact, my azwyon contains 200 lines of Rust code (not counting my thread pool library). It fast, simple and parallel. With minimal edits it can gain CDC support. This will make it outperform all other FOSS deduppers
I used FastCDC as the basis for deltification in git9. RapidCDC and QuickCDC look like significant improvements.
Looks like it’s time to do a pass over that code and speed it up.
For context: Because git9 doesn’t do delta reuse, deltification speed matters more than in Torvalds git. It’s not measurable for most operations on most repos, but fresh clones of very large repos it can take some extra minutes.
FastCDC made it practical to avoid the mess and complexity around delta reuse. These updates may make it unnoticeable in all cases.
Oh this is a very cool concept. I’ve heard people talk about wanting this kind of thing a time or two, for chunking and deduplicating files in some way better than just doing fixed-size pieces. But nobody involved ever really knew what to even look for.
I really enjoyed the previous entries in this series, and I’m hoping to someday get around to trying (at least) FastCDC for delta-compression.
I found it more difficult to understand what RapidCDC and QuickCDC were doing in this post. Maybe I’m not reading closely enough, but I can see talk of storing previously-seen chunk sizes, and presumably they’re used to predict future chunk sizes, though I can’t quite imagine exactly how that works. The previous entries had nice diagrams and source-code snippets that really helped.
That said, it sounds like RapidCDC and QuickCDC don’t offer much benefit if you have to read the entire file anyway to verify that chunks do not contain inner modifications. Maybe I should just stick to FastCDC and ignore the others.
Honestly I found the same thing when reading the papers. They seem to be built on all this previous knowledge of how chunking works (unlike the DDelta and FastCDC papers), and so (at least in my reading of them) they feel more set out as just “here are some optimizations to try and skip over a bunch of bytes”.
Yea that’s right. If, when we identify a chunk boundary, we have access to a list of the sizes for chunks that follow the current chunk, then we can use those sizes as hints for where the next chunk boundary is. The papers seem to be quite good at spelling out how this works, but not so good at considering how to determine if that chunk is a duplicate or unique. They provide a bunch of options, but none that really seem to work in all cases without also reading every byte.
This was the realization I came to. And I’m worried I’m just missing something because this seems like a pretty big flaw in these two CDC algorithms. Then again, maybe there are some chunking use cases that don’t care so much about chunks being treated faithfully as duplicates 100% of the time, and 99.98% of the time is acceptable.
This would be my recommendation (and I nearly wrote that in the post). The nice thing about the optimizations in these papers is that they are all pretty orthogonal. Personally, I would start out with FastCDC and then investigate optimizing with techniques from these (and other) papers later down the line if it becomes necessary.
Hi! I recently did benchmark for various deduplicating solutions. I. e. for programs, which maintain some kind of “repo” for storing and deduplicating data, such as borg, casync, etc. Most of them are based on some kind of CDC.
Results are here: https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 (see whole thread for context)
Results are very unfortunate. Turns out FOSS deduplicating solutions miss many optimizations. All they are slower than they should be. So I had to develop my own fast solution called azwyon (it is also included in benchmark above). But azwyon uses fixed chunking as opposed to CDC, because this is what needed for my problem.
Okay, so here are list of problems with existing CDC-based solutions, such as borg and casync. I will use “extract” and “store” as names of basic operations of such solutions.
So, it seems that current FOSS situation with deduppers is very bad. All them can be easily improved in terms of speed. Or a new fast solution can be created, which will outperform existing solutions
In fact, my azwyon contains 200 lines of Rust code (not counting my thread pool library). It fast, simple and parallel. With minimal edits it can gain CDC support. This will make it outperform all other FOSS deduppers
I added some UPDs
I used FastCDC as the basis for deltification in git9. RapidCDC and QuickCDC look like significant improvements.
Looks like it’s time to do a pass over that code and speed it up.
For context: Because git9 doesn’t do delta reuse, deltification speed matters more than in Torvalds git. It’s not measurable for most operations on most repos, but fresh clones of very large repos it can take some extra minutes.
FastCDC made it practical to avoid the mess and complexity around delta reuse. These updates may make it unnoticeable in all cases.
Oh this is a very cool concept. I’ve heard people talk about wanting this kind of thing a time or two, for chunking and deduplicating files in some way better than just doing fixed-size pieces. But nobody involved ever really knew what to even look for.