You need a sort, then you need a diff that doesn’t show additions only deletions and changes, then you need the file2→file shuffling at the end, and pretty soon you have yourself a chunky script—and you haven’t even gotten to pretty-printing a progress bar. I’m a shell scripting junkie but this comment is not fair to what the tool in this post is actually doing.
Okay, this was a 5 minute hack. I can spend 10 more minutes on it and it does everything the tool does (minus the porgress bar). The point is that the unix philosophy tells us that you should combine the tools you have to build higher level tools and this is perfectly doable in no time for the problem at hand.
OK, but maybe one can interpret this as the “tools you have” being Rust packages, and you combine them in Rust? Then you get the benefit of using a modern language with data types and stuff, instead of a gnarly shell with WTF syntax* whose only data type is byte-streams.
* I know there are shell enthusiasts here, but really, if it didn’t exist and anyone announced it as a new programming language they’d be laughed out of town IMO.
find -type f -print0 | relpipe-in-filesystem --file path --streamlet hash --relation files_1 > ../files_1.rp
# do some changes
find -type f -print0 | relpipe-in-filesystem --file path --streamlet hash --relation files_2 > ../files_2.rp
cat ../files_*.rp | relpipe-tr-sql --relation 'diff' "SELECT f1.path, f1.sha256 AS old_hash, f2.sha256 AS new_hash FROM files_1 AS f1 LEFT JOIN files_2 AS f2 ON (f1.path = f2.path) WHERE f1.sha256 <> f2.sha256 OR f2.path IS NULL" | relpipe-out-tabular
And get result like this (I removed bash-completion.sh, modified CLIParser.h and added a new file that is not listed):
find piped to xargs is a great example of something I just never use anymore, now that I have a Rewrite It In Rust™ version (fd). Nice and simple:
fd -t f -x shasum -a 512
Let’s see how it compares to your version . . . Whoops! Your version doesn’t work at all! It breaks with lots of perfectly valid filenames. To fix that, I have to add -0 and -print0, so it’s even more awkward to use, and it’s still slower than fd.
I don’t use too many of the Rewrite It In Rust™ tools myself, but fd and rg are really nice.
As an aside, you should also probably get out of the habit of running find . and redirecting it to a file in the current directory, since find is likely feeding its output back into itself. When I run your find, I end up with a checksum for files which isn’t right anymore by the time find finishes running. It’s probably not what you wanted.
I think the difference is find -exec alone isn’t parallel, you have to add something like xargs to get parallelism. Whereas as fd -x is parallel by itself.
Because these tools straight up don’t work properly, unless you heavily invest in learning them. Your solution, for example, breaks if there are spaces in filenames.
$ echo 1 > '1 2'
$ echo 3 > '3 4'
$ find . -type f | xargs -n 1 -P 4 shasum -a 512 > files
shasum: ./1: No such file or directory
shasum: 2: No such file or directory
shasum: ./3: No such file or directory
shasum: 4: No such file or directory
This one will work:
find . -type f -print0 | xargs -0 -n 1 -P 4 shasum -a 512
It’s definitely faster to write this down in bash than.. Java. But I’d definitely be scared to not mess up quotation, special file names and what else shellcheck tries to catch. And then there is the issue remembering what this actually did in some weeks.
Because you already have nGB of files on your Mac or Windows box that would be Quite Traumatic to move over to another box specifically running ZFS? Whereas I can compile this tool up on my Macbook for both macOS and Windows without any faff.
ZFS is awesome but does come with costs. E.g. because of the licensing situation it’s a bit of a pain to use on Linux (both because you have to build zfs.ko somehow and because it’s just not deeply integrated in the kernel so you get weirdness like having two competing I/O elevators), defragging it requires a send/recv round-trip, it can’t do cp --reflink… etc. etc.
That being said: I’m a VERY happy ZFS user. I switched from btrfs years ago after btrfs ate my data for the third(!) time, and I’ve never looked back.
If you really want durable storage for a vault of important files, check out git annex. It handles storing large files in Git repositories in a long term manageable way—keeping distributed records of other checkouts and what checkouts have what files, maintaining at least N copies of files on other checkouts if you want lighten up a filesystem by deleting some files from a checkout, etc. It also handles faulty data quite gracefully and helps you copy in corrupted blobs from whichever other repositories have copies.
This is cool! How common is bit-rot on typical storage media? It’s the kind of thing I usually pretend doesn’t exist 😬
On filesystems with extended attributes, you could store the digest as an attribute on the file itself. Then the tool can operate without an external database. If you also store a digest-of-digests in each directory, you can detect files being deleted or added or swapped out.
Then you could allow the tool to create signatures, (or just sign the root dir’s digest) and now you’ve got an archive no one else can alter undetected even if they re-run the tool afterward…
Then you could allow the tool to create signatures, (or just sign the root dir’s digest) and now you’ve got an archive no one else can alter undetected even if they re-run the tool afterward…
I also need to hash a large directory structure in rust to track changes so I was interested to checkout the code to see if it can be done more efficiently than I’m currently doing but I get a 401 when I go to the repo
This tool might be directly useful for me and I’m trying it out. I’m pretty sure I’ve run into bitrot in files I’ve had for a very long time, e.g. random sound glitches in mp3 files I downloaded in middle or high school that I don’t think were always there. I’ve been using ZFS or btrfs file systems to store most of my files for the past several years, so hopefully that is a less of a concern now.
I know everyone loves to write stuff in rust, but why not use unix tools you already have?
find . -type f | xargs -n 1 -P 4 shasum -a 512 > files
some time later
find . -type f | xargs -n 1 -P 4 shasum -a 512 > files2
and finally you can use
diff
to find differences. Probably best to throw in asort
after thefind
for better diffing.It is all already there, right on your system.
You need a sort, then you need a diff that doesn’t show additions only deletions and changes, then you need the file2→file shuffling at the end, and pretty soon you have yourself a chunky script—and you haven’t even gotten to pretty-printing a progress bar. I’m a shell scripting junkie but this comment is not fair to what the tool in this post is actually doing.
Okay, this was a 5 minute hack. I can spend 10 more minutes on it and it does everything the tool does (minus the porgress bar). The point is that the unix philosophy tells us that you should combine the tools you have to build higher level tools and this is perfectly doable in no time for the problem at hand.
OK, but maybe one can interpret this as the “tools you have” being Rust packages, and you combine them in Rust? Then you get the benefit of using a modern language with data types and stuff, instead of a gnarly shell with WTF syntax* whose only data type is byte-streams.
* I know there are shell enthusiasts here, but really, if it didn’t exist and anyone announced it as a new programming language they’d be laughed out of town IMO.
With Relational pipes:
And get result like this (I removed
bash-completion.sh
, modifiedCLIParser.h
and added a new file that is not listed):I use this to catalogize some removable/remote media and search them offline.
I love myself a good progress bar ! :D
git diff --no-index --diff-filter=MD
. I usegit diff --no-index
quite a lot (as an alias) as I like it over the plaindiff
command.For me personally it was few different things.
find
piped toxargs
is a great example of something I just never use anymore, now that I have a Rewrite It In Rust™ version (fd
). Nice and simple:Let’s see how it compares to your version . . . Whoops! Your version doesn’t work at all! It breaks with lots of perfectly valid filenames. To fix that, I have to add
-0
and-print0
, so it’s even more awkward to use, and it’s still slower thanfd
.I don’t use too many of the Rewrite It In Rust™ tools myself, but
fd
andrg
are really nice.As an aside, you should also probably get out of the habit of running
find .
and redirecting it to a file in the current directory, sincefind
is likely feeding its output back into itself. When I run yourfind
, I end up with a checksum forfiles
which isn’t right anymore by the timefind
finishes running. It’s probably not what you wanted.Yeah. I also avoid xargs, and use
find -exec
instead. I definitely like the single character flags infd
, though.I think the difference is
find -exec
alone isn’t parallel, you have to add something likexargs
to get parallelism. Whereas asfd -x
is parallel by itself.Because these tools straight up don’t work properly, unless you heavily invest in learning them. Your solution, for example, breaks if there are spaces in filenames.
This one will work:
It’s definitely faster to write this down in bash than.. Java. But I’d definitely be scared to not mess up quotation, special file names and what else shellcheck tries to catch. And then there is the issue remembering what this actually did in some weeks.
I’d suggest aide, sort (case-insensitive), then comm.
Why not use ZFS instead?
It already has hashes for everything you put on it, its free and with redundancy it will even heal your broken chunks.
… and you will even save some/a lot of space with builtin zstd compression.
Because you already have nGB of files on your Mac or Windows box that would be Quite Traumatic to move over to another box specifically running ZFS? Whereas I can compile this tool up on my Macbook for both macOS and Windows without any faff.
ZFS is awesome but does come with costs. E.g. because of the licensing situation it’s a bit of a pain to use on Linux (both because you have to build
zfs.ko
somehow and because it’s just not deeply integrated in the kernel so you get weirdness like having two competing I/O elevators), defragging it requires asend
/recv
round-trip, it can’t docp --reflink
… etc. etc.That being said: I’m a VERY happy ZFS user. I switched from btrfs years ago after btrfs ate my data for the third(!) time, and I’ve never looked back.
If you really want durable storage for a vault of important files, check out
git annex
. It handles storing large files in Git repositories in a long term manageable way—keeping distributed records of other checkouts and what checkouts have what files, maintaining at least N copies of files on other checkouts if you want lighten up a filesystem by deleting some files from a checkout, etc. It also handles faulty data quite gracefully and helps you copy in corrupted blobs from whichever other repositories have copies.This is cool! How common is bit-rot on typical storage media? It’s the kind of thing I usually pretend doesn’t exist 😬
On filesystems with extended attributes, you could store the digest as an attribute on the file itself. Then the tool can operate without an external database. If you also store a digest-of-digests in each directory, you can detect files being deleted or added or swapped out.
Then you could allow the tool to create signatures, (or just sign the root dir’s digest) and now you’ve got an archive no one else can alter undetected even if they re-run the tool afterward…
I love hash trees, and I cannot lie.
[More common than you would think] (https://youtu.be/fE2KDzZaxvE?t=28m17s).
You’re describing fs-verity! https://www.kernel.org/doc/html/latest/filesystems/fsverity.html
Interesting! Apple’s filesystems may have something like this too, since they make extensive use of code-signing.
I also need to hash a large directory structure in rust to track changes so I was interested to checkout the code to see if it can be done more efficiently than I’m currently doing but I get a 401 when I go to the repo
Ah, definitely not my intention, fixed the permissions.
hg clone https://hg.sr.ht/~cyplo/legdur
should work for everyone now.Thanks!
I think that’s why the blog post says “ Let me know if you’d like to hack on this by contacting me”.
I also haven’t figured out how much of collab infra I need/want, it’s mostly just a mercurial repo for now :)
Seems a bit like snapraid https://www.snapraid.it/
This tool might be directly useful for me and I’m trying it out. I’m pretty sure I’ve run into bitrot in files I’ve had for a very long time, e.g. random sound glitches in mp3 files I downloaded in middle or high school that I don’t think were always there. I’ve been using ZFS or btrfs file systems to store most of my files for the past several years, so hopefully that is a less of a concern now.