1. 21
  1.  

  2. 17

    I know everyone loves to write stuff in rust, but why not use unix tools you already have?

    find . -type f | xargs -n 1 -P 4 shasum -a 512 > files

    some time later

    find . -type f | xargs -n 1 -P 4 shasum -a 512 > files2

    and finally you can use diff to find differences. Probably best to throw in a sort after the find for better diffing.

    It is all already there, right on your system.

    1. 13

      You need a sort, then you need a diff that doesn’t show additions only deletions and changes, then you need the file2→file shuffling at the end, and pretty soon you have yourself a chunky script—and you haven’t even gotten to pretty-printing a progress bar. I’m a shell scripting junkie but this comment is not fair to what the tool in this post is actually doing.

      1. 3

        Okay, this was a 5 minute hack. I can spend 10 more minutes on it and it does everything the tool does (minus the porgress bar). The point is that the unix philosophy tells us that you should combine the tools you have to build higher level tools and this is perfectly doable in no time for the problem at hand.

        1. 18

          OK, but maybe one can interpret this as the “tools you have” being Rust packages, and you combine them in Rust? Then you get the benefit of using a modern language with data types and stuff, instead of a gnarly shell with WTF syntax* whose only data type is byte-streams.

          * I know there are shell enthusiasts here, but really, if it didn’t exist and anyone announced it as a new programming language they’d be laughed out of town IMO.

        2. 2

          With Relational pipes:

          find -type f -print0 | relpipe-in-filesystem --file path --streamlet hash --relation files_1 > ../files_1.rp
          # do some changes
          find -type f -print0 | relpipe-in-filesystem --file path --streamlet hash --relation files_2 > ../files_2.rp
          cat ../files_*.rp | relpipe-tr-sql --relation 'diff' "SELECT f1.path, f1.sha256 AS old_hash, f2.sha256 AS new_hash FROM files_1 AS f1 LEFT JOIN files_2 AS f2 ON (f1.path = f2.path) WHERE f1.sha256 <> f2.sha256 OR f2.path IS NULL" | relpipe-out-tabular
          

          And get result like this (I removed bash-completion.sh, modified CLIParser.h and added a new file that is not listed):

          diff:
           ╭──────────────────────┬──────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────╮
           │ path        (string) │ old_hash                                                (string) │ new_hash                                                (string) │
           ├──────────────────────┼──────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────┤
           │ ./bash-completion.sh │ 3f8c20eb917302c01a029711e3868665185c314d4a8d32cc2dfe1091521402c8 │                                                                  │
           │ ./src/CLIParser.h    │ ba75414ced3163ce2b4a31f070a601d27258750806843bb18464e5efb5bc71fd │ 3f64b56674587f3b4130d531d46b547a44c18523abf0c4a3c09696277a4de6f0 │
           ╰──────────────────────┴──────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────╯
          Record count: 2
          

          I use this to catalogize some removable/remote media and search them offline.

          1. 1

            git diff --no-index --diff-filter=MD. I use git diff --no-index quite a lot (as an alias) as I like it over the plain diff command.

            1. 1

              I love myself a good progress bar ! :D

            2. 11

              I know everyone loves to write stuff in rust, but why not use unix tools you already have?

              For me personally it was few different things.

              • I like Rust, we don’t use it at work much so trying to not forget how to use it by writing small programs like those.
              • speed - the computations are heavily parallelised, which was easier for me to write in Rust than using e.g. gnu parallel
              • cannot stress enough how I love progress bars, and writing a decent one in a shell scripting language exceeds my allocated brain points :D
              1. 7

                find piped to xargs is a great example of something I just never use anymore, now that I have a Rewrite It In Rust™ version (fd). Nice and simple:

                    fd -t f -x shasum -a 512
                

                Let’s see how it compares to your version . . . Whoops! Your version doesn’t work at all! It breaks with lots of perfectly valid filenames. To fix that, I have to add -0 and -print0, so it’s even more awkward to use, and it’s still slower than fd.

                I don’t use too many of the Rewrite It In Rust™ tools myself, but fd and rg are really nice.

                As an aside, you should also probably get out of the habit of running find . and redirecting it to a file in the current directory, since find is likely feeding its output back into itself. When I run your find, I end up with a checksum for files which isn’t right anymore by the time find finishes running. It’s probably not what you wanted.

                1. 1

                  Yeah. I also avoid xargs, and use find -exec instead. I definitely like the single character flags in fd, though.

                  1. 1

                    I think the difference is find -exec alone isn’t parallel, you have to add something like xargs to get parallelism. Whereas as fd -x is parallel by itself.

                2. 4

                  Because these tools straight up don’t work properly, unless you heavily invest in learning them. Your solution, for example, breaks if there are spaces in filenames.

                  $ echo 1 > '1 2'
                  $ echo 3 > '3 4'
                  $ find . -type f | xargs -n 1 -P 4 shasum -a 512 > files
                  shasum: ./1: No such file or directory
                  shasum: 2: No such file or directory
                  shasum: ./3: No such file or directory
                  shasum: 4: No such file or directory
                  

                  This one will work:

                  find . -type f -print0 | xargs -0 -n 1 -P 4 shasum -a 512
                  
                  1. 1

                    It’s definitely faster to write this down in bash than.. Java. But I’d definitely be scared to not mess up quotation, special file names and what else shellcheck tries to catch. And then there is the issue remembering what this actually did in some weeks.

                    1. 1

                      I’d suggest aide, sort (case-insensitive), then comm.

                    2. 5

                      Why not use ZFS instead?

                      It already has hashes for everything you put on it, its free and with redundancy it will even heal your broken chunks.

                      … and you will even save some/a lot of space with builtin zstd compression.

                      1. 1

                        Why not use ZFS instead?

                        Because you already have nGB of files on your Mac or Windows box that would be Quite Traumatic to move over to another box specifically running ZFS? Whereas I can compile this tool up on my Macbook for both macOS and Windows without any faff.

                        1. 1

                          ZFS is awesome but does come with costs. E.g. because of the licensing situation it’s a bit of a pain to use on Linux (both because you have to build zfs.ko somehow and because it’s just not deeply integrated in the kernel so you get weirdness like having two competing I/O elevators), defragging it requires a send/recv round-trip, it can’t do cp --reflink… etc. etc.

                          That being said: I’m a VERY happy ZFS user. I switched from btrfs years ago after btrfs ate my data for the third(!) time, and I’ve never looked back.

                        2. 4

                          If you really want durable storage for a vault of important files, check out git annex. It handles storing large files in Git repositories in a long term manageable way—keeping distributed records of other checkouts and what checkouts have what files, maintaining at least N copies of files on other checkouts if you want lighten up a filesystem by deleting some files from a checkout, etc. It also handles faulty data quite gracefully and helps you copy in corrupted blobs from whichever other repositories have copies.

                          1. 3

                            This is cool! How common is bit-rot on typical storage media? It’s the kind of thing I usually pretend doesn’t exist 😬

                            On filesystems with extended attributes, you could store the digest as an attribute on the file itself. Then the tool can operate without an external database. If you also store a digest-of-digests in each directory, you can detect files being deleted or added or swapped out.

                            Then you could allow the tool to create signatures, (or just sign the root dir’s digest) and now you’ve got an archive no one else can alter undetected even if they re-run the tool afterward…

                            I love hash trees, and I cannot lie.

                            1. 1

                              This is cool! How common is bit-rot on typical storage media?

                              [More common than you would think] (https://youtu.be/fE2KDzZaxvE?t=28m17s).

                              Then you could allow the tool to create signatures, (or just sign the root dir’s digest) and now you’ve got an archive no one else can alter undetected even if they re-run the tool afterward…

                              You’re describing fs-verity! https://www.kernel.org/doc/html/latest/filesystems/fsverity.html

                              1. 1

                                Interesting! Apple’s filesystems may have something like this too, since they make extensive use of code-signing.

                            2. 1

                              I also need to hash a large directory structure in rust to track changes so I was interested to checkout the code to see if it can be done more efficiently than I’m currently doing but I get a 401 when I go to the repo

                              1. 2

                                Ah, definitely not my intention, fixed the permissions. hg clone https://hg.sr.ht/~cyplo/legdur should work for everyone now.

                                1. 1

                                  Thanks!

                                2. 1

                                  I think that’s why the blog post says “ Let me know if you’d like to hack on this by contacting me”.

                                  1. 1

                                    I also haven’t figured out how much of collab infra I need/want, it’s mostly just a mercurial repo for now :)

                                3. 1

                                  Seems a bit like snapraid https://www.snapraid.it/

                                  1. 1

                                    This tool might be directly useful for me and I’m trying it out. I’m pretty sure I’ve run into bitrot in files I’ve had for a very long time, e.g. random sound glitches in mp3 files I downloaded in middle or high school that I don’t think were always there. I’ve been using ZFS or btrfs file systems to store most of my files for the past several years, so hopefully that is a less of a concern now.