I used to do that for ImageOptim app tarballs. I’ve even written a genetic algorithm that searched for optimal similarity criteria for creating chains of similar files.
80/20 version of this is:
find . -type f | rev | sort | rev | tar c --files-from=-
that sorts by file extension/name, which tends to group similar files together.
Great article! Reminds me of 7-Zip a long time ago releasing a feature to sort files by file extension before compression.
By the way, your method 7 where you swap files and keep the swap if there’s improvement is similar to a discrete optimization strategy called “local search”: https://en.wikipedia.org/wiki/Local_search_(optimization). It is a strategy that tries to find feasible (correct) and optimal (lowest cost) solutions to problems with very large search spaces with no efficient solution, for example the Traveling Salesperson Problem, the Knapsack problem, etc.
Local search was by far my favorite part of a Discrete Optimization course on Coursera: https://www.coursera.org/learn/discrete-optimization/. There are a surprising range of “meta heuristics” you can apply to do better. Hill climbing, tabu search, and restarts are my favorite.
Applying these meta heuristics to your method 7 would be an interesting experiment!
Several of the solutions to this fun CodeGolf.SE challenge use that swap-and-test strategy, with large numbers of iterations, to permute the pixels of one image in a way that gives the best visual match to another image. (Two of the solutions there are mine).
I don’t think there’s any benefit to the locality constraint of only swapping adjacent files — if a file’s optimal position is far from its current position, the constraint will at best slow the convergence down, and maybe even completely forbid it because there’s an intermediate state with a worse score. Might as well just choose two distinct random indices and try swapping them; a good stopping condition is when you’ve failed to make an improvement on N successive tries.
I don’t think there’s any benefit to the locality constraint of only swapping adjacent files
Swapping any pair of files is still a local search. A local search just means the next solution you explore is “close” to your current one. It does not constrain you to adjacent pairs of files in a tar file.
A non-local search would be shuffling the full list of files. But this idea of throwing up your hands and needing a fresh start is a metaheuristic called “restarts”.
a good stopping condition is when you’ve failed to make an improvement on N successive tries.
In itself a metaheuristic called “hill climbing”. Modifying N depending on how long you’ve spent on the hill or other factors is “simulated annealing” (increase or decrease N over time depending on what your goal is).
Absolutely. I started reading a bit of Essentials of Metaheuristics some time ago, and would love to spend some more time on it and try some of the options out.
It seems to me that the option where you permute then run swaps is a form of iterated local search, though perhaps there’s a subtlety that I’m missing?
It seems to me that the option where you permute then run swaps is a form of iterated local search, though perhaps there’s a subtlety that I’m missing?
Swapping pairs of files in the list is a form of local search I agree. There is no such thing as a violating configuration, all orderings of files are valid, so that part is much easier. You are only focused on optimizing the cost.
Also I’m not suggesting local search will be the best solution, just that your article’s method 7 reminded me of it and you should try especially tabu search, hill climbing, and simulated annealing. When I did the graph coloring problem in the course these methods really helped, but graph coloring is a different problem.
This is an interesting idea, thanks for posting it.
Have you tried combining the reordered results with running Zopfli with a large number of trials?
Edited to change comparing to combining.
I had meant the latter but written the former.
The child posts by @Boojum, @jibsen, and @hyperpape precede this edit.
Edited again to add another child post in the preceding paragraph.
Given what was done here and what I know of Zopfli, I wouldn’t expect them to be alternative optimizations to each other. Rather, I’d expect they’d be very complementary and that each would make the other more effective. I suspect you’ll get better results from combining them than each approach can give individually.
Indeed, Gzip and Zopfli use the deflate compressed data format, which (along with entropy coding) looks for repeated sequences (matches) in a 32k sliding window.
So if you imagine two identical small files (< 16k) inside your tar archive – if they are more than 32k apart, both Gzip and Zopfli have to compress the data twice. If on the other hand the two files are right next to each other, the entire second file can be very compactly encoded using matches into the first file.
I didn’t. According to Wikipedia, zopfli gives 3%-8% better compression than gzip, so better than what I got here.
Don’t underrate your results - compression is a mature area and Zopfli is a state of the art algorithm.
The results of running Zopfli are heavily dependent on the input file and the number of random trials.
The file is tiny so you can easily run 500,000+ trials without a long delay.
I used Zopfli for image optimization years ago and have seen larger improvements, e.g. 10% or more.
This is a very interesting idea. I wonder how this would work with JSON data; I have been trying to gzip compress JSON data, and I found that using escaping instead of UTF-8 give me a small compression boost (maybe because every byte has the most significant bit set to 0?), but I have been wondering if I could squeeze more by sorting arrays and dictionaries in a specific way.
I used to do that for ImageOptim app tarballs. I’ve even written a genetic algorithm that searched for optimal similarity criteria for creating chains of similar files.
80/20 version of this is:
that sorts by file extension/name, which tends to group similar files together.
parsing
ls
is dangerous, but it does have a handy option-X
to sort files by extension (might be GNU specific)Related: Gwern has written up his real-world experience with file sorting to improve compression.
Archivers that have a solid option like RAR have used grouping files by extension for a long time.
git does similar ordering inside packfiles before compression: https://www.git-scm.com/docs/pack-heuristics https://git-scm.com/docs/git-repack (search for sort)
Great article! Reminds me of 7-Zip a long time ago releasing a feature to sort files by file extension before compression.
By the way, your method 7 where you swap files and keep the swap if there’s improvement is similar to a discrete optimization strategy called “local search”: https://en.wikipedia.org/wiki/Local_search_(optimization). It is a strategy that tries to find feasible (correct) and optimal (lowest cost) solutions to problems with very large search spaces with no efficient solution, for example the Traveling Salesperson Problem, the Knapsack problem, etc.
Local search was by far my favorite part of a Discrete Optimization course on Coursera: https://www.coursera.org/learn/discrete-optimization/. There are a surprising range of “meta heuristics” you can apply to do better. Hill climbing, tabu search, and restarts are my favorite.
Applying these meta heuristics to your method 7 would be an interesting experiment!
Several of the solutions to this fun CodeGolf.SE challenge use that swap-and-test strategy, with large numbers of iterations, to permute the pixels of one image in a way that gives the best visual match to another image. (Two of the solutions there are mine).
I don’t think there’s any benefit to the locality constraint of only swapping adjacent files — if a file’s optimal position is far from its current position, the constraint will at best slow the convergence down, and maybe even completely forbid it because there’s an intermediate state with a worse score. Might as well just choose two distinct random indices and try swapping them; a good stopping condition is when you’ve failed to make an improvement on N successive tries.
Thanks for the CodeGolf link! Looks fun.
Swapping any pair of files is still a local search. A local search just means the next solution you explore is “close” to your current one. It does not constrain you to adjacent pairs of files in a tar file.
Again referencing the excellent Coursera course, here’s an explanation with respect to the car sequencing problem: https://www.coursera.org/lecture/discrete-optimization/ls-2-swap-neighborhood-car-sequencing-magic-square-2vaaM
A non-local search would be shuffling the full list of files. But this idea of throwing up your hands and needing a fresh start is a metaheuristic called “restarts”.
In itself a metaheuristic called “hill climbing”. Modifying N depending on how long you’ve spent on the hill or other factors is “simulated annealing” (increase or decrease N over time depending on what your goal is).
Yeah, sorry, I do understand that, I was just discussing OP’s implementation :)
Absolutely. I started reading a bit of Essentials of Metaheuristics some time ago, and would love to spend some more time on it and try some of the options out.
It seems to me that the option where you permute then run swaps is a form of iterated local search, though perhaps there’s a subtlety that I’m missing?
I’d start with the Coursera course videos on local search and if you’re inclined try out the graph coloring problem using LS.
https://www.coursera.org/lecture/discrete-optimization/ls-1-intuition-n-queens-ZRJeF
Swapping pairs of files in the list is a form of local search I agree. There is no such thing as a violating configuration, all orderings of files are valid, so that part is much easier. You are only focused on optimizing the cost.
Also I’m not suggesting local search will be the best solution, just that your article’s method 7 reminded me of it and you should try especially tabu search, hill climbing, and simulated annealing. When I did the graph coloring problem in the course these methods really helped, but graph coloring is a different problem.
This is an interesting idea, thanks for posting it.
Have you tried combining the reordered results with running Zopfli with a large number of trials?
Edited to change comparing to combining.
I had meant the latter but written the former.
The child posts by @Boojum, @jibsen, and @hyperpape precede this edit.
Edited again to add another child post in the preceding paragraph.
Given what was done here and what I know of Zopfli, I wouldn’t expect them to be alternative optimizations to each other. Rather, I’d expect they’d be very complementary and that each would make the other more effective. I suspect you’ll get better results from combining them than each approach can give individually.
Indeed, Gzip and Zopfli use the deflate compressed data format, which (along with entropy coding) looks for repeated sequences (matches) in a 32k sliding window.
So if you imagine two identical small files (< 16k) inside your tar archive – if they are more than 32k apart, both Gzip and Zopfli have to compress the data twice. If on the other hand the two files are right next to each other, the entire second file can be very compactly encoded using matches into the first file.
I didn’t. According to Wikipedia, zopfli gives 3%-8% better compression than gzip, so better than what I got here.
This was fun, and I think I might keep playing around with the idea, so I might do that comparison in the future.
Don’t underrate your results - compression is a mature area and Zopfli is a state of the art algorithm.
The results of running Zopfli are heavily dependent on the input file and the number of random trials.
The file is tiny so you can easily run 500,000+ trials without a long delay.
I used Zopfli for image optimization years ago and have seen larger improvements, e.g. 10% or more.
This is a very interesting idea. I wonder how this would work with JSON data; I have been trying to gzip compress JSON data, and I found that using escaping instead of UTF-8 give me a small compression boost (maybe because every byte has the most significant bit set to 0?), but I have been wondering if I could squeeze more by sorting arrays and dictionaries in a specific way.