1. 22
  1.  

  2. 12

    I don’t like tools that make using awk or cut harder.

    The output could be improved without using the parenthesis around the bytes, or having a flag .

    1. 5

      Tools should have optional JSON output so I can use jq queries instead of awk or cut :P

      1. 4

        I really like jq too :)

        1. 2

          https://github.com/Juniper/libxo

          This is integrated in most FreeBSD utilities.

          1. 3

            We should all switch to Powershell, where values have types and structure, instead of switching to a different text format.

        2. 10

          This claims to be 5x faster than a non-parallel version, which strongly implies that du is somehow(?) limited by computation speed instead of by disk access time, which is making me question the very mature of the universe. Can anyone explain that?

          1. 19

            No, it’s limited by disk access time, but all the disk access is done sequentially in du. stat(), stat(), stat(), … Every call must complete before the next, so your queue depth to read disk is only 1. If a dozen threads issue stat() at the same time, the disk can issue multiple reads. An SSD has a latency somewhere around 10ms, but it can complete more than one request per 10ms given the opportunity.

            1. 5

              Isn’t the solution then “make du parallel” instead of “rewrite du in a new language and introduce a second utility”?

              1. 9

                I predict people might ask if the performance of du is actually such a bottleneck to justify the complexity. If somebody posted a patch to pthread openbsd du, I wouldn’t immediately jump on board.

                You can sidestep such questions by dropping the code on github. Everybody says, wow, cool, that’s great, but nobody is required to make a decision to use or maintain such a tool.

                Maybe it’s good enough leave it alone holds us back. Maybe let’s change everything and hope it’s better keeps us going in circles.

                cc @caleb

                1. 2

                  Relatively difficult call; the performance of du has definitely bothered me at times, but parallel c code is several kinds of hard-to-get-right.

              2. 0

                any reason du shouldn’t be modified to use this approach?

              3. 3

                It’s probably multi-faceted (as any benchmark data). I believe du (and dup probably) look at metadata to find file sizes, so I don’t think there is much bandwidth required. On top of that, if the benchmark is using NAND flash storage, the firmware driver may be able to parallelize read requests far better than a HDD. Even then, issuing multiple read requests to the HDD driver can allow it to optimize its read pattern to minimize seek distance for the head. I would comment on multi-platter HDDs, but I honestly know very little about the implementation of HDDs and SSDs, so someone please correct me if I’ve said anything too far from the truth.

                You can even see speed-up in multi-threaded writing, for example in asynchronous logging libraries: https://github.com/gabime/spdlog

                EDIT: in short, the speedup is probably from parallelizing the overhead of file system access

                EDIT2: The benchmarks on the spdlog readme are a bad example because it uses multiple threads to write to the same file, not multiple threads writing to separate independent files

              4. 9

                I like the tendency by Go and now Rust of writing small utilities that are improvements of existing utilities.

                Looks very cool!

                But the name…. dup? Aside from that being a built-in in many versions of Forth like languages, it should be mnemonically associated with something that involves two of something.

                1. 4

                  I agree. I didn’t really think long before naming this dup. The project has now been renamed to diskus (short for “disk usage”, also: a German word for the disc in Discus throw).

                  1. 2

                    Bless you!

                    I made the github issue thinking you might have had strong feelings, but you listened and came up with a much more appropiate name. If only all my interactions with other developers were so reasonable.

                  2. 3

                    maybe author didn’t care for the sound of dush

                    1. 3

                      i don’t get the point. if you want du to be faster, make it faster and upstream the changes. why add more baggage that i have to drag around whenever i’m at a new computer?

                      is this just people padding their githubs to impress employers?

                      1. 2

                        I guess duparallel? But yes, it sounds like something that would find duplicates instead of calculating directory sizes. Hardest thing in computer science?… :)

                      2. 18

                        Depends on 36 crates, 1.9M output binary (after stripping it manually)

                        For some value of “minimal”, I guess.

                        1. 11

                          I feel like this is a bit of cheap shot, considering that this follows defaults and the defaults of rustc might differ from your expectations (e.g. debug symbols, stdlib, jemalloc, which is easily 2 megs). The crates situation is also debatable - it does use a concurrency framework under the hood that is well-factored into pieces.

                          Most of the size comes from just one pair of crates, btw: regex and regex_syntax. Most of the crates are indeed lightweight, providing such things as a reusable parking lot reimplementation.

                          The tool is rather small (a bit over 100 lines). If it were implemented in Go, for example, you’d pull all that in through a runtime, if it were in C, I’d be interested to see how much code you need to pull in (or write yourself) to get to an equivalent level.

                          If you’d want to trim this, there’s certainly ways, but yes, I think you can definitely say this is rather minimal.

                          1. 18

                            A stubborn focus on the absolute number of crates is indeed a bit weird from the perspective of “minimalism.” It would be “easy” to reduce that count if only everybody built monolithic crates and never factored things out into reusable components. For example, in regex’s case, the parser, aho-corasick, thread locals and UTF-8 automata construction tools could all legitimately be rolled directly into the regex crate itself, thereby decreasing the absolute number of crates and trivially satisfying “minimalists” such as @pcy everywhere. But that’s basically where the benefits stop, because now nobody can reuse any of those crates. The same goes for parking_lot or crossbeam. You could reduce the number of absolute crates by building more monolithic crates, and forcing anyone with a more refined use case to either copy and paste or depend on more than they need.

                            With that said, one might question why regex is being depended on at all in the first place, and I think that might be a valid criticism. e.g., It does look like dup is not using the ignore crate for its filtering support, but rather, for its parallel directory iterator. The only reason why these two things are coupled in the same crate is because I haven’t devoted any resources to de-coupling them. Ironically, decoupling them will (probably) just increase the absolute number of crates that one brings in when depending on ignore, but also simultaneously enabling tools like this to depend on fewer crates.

                            To play the devil’s advocate, an absolute number of crates can increase maintenance burden for folks. I tend to like to keep an eye on every dependency I use, transitively or otherwise, to make sure I’m up to date on what’s going on there. This becomes intractable as the number of crates grows. However, when the count balloons because one logical crate starts splitting itself apart internally, that tends to be OK because I’m only dealing with the higher level crate.

                            Lobsters does love their pithy one-line zingers, even when they lack substance, just like reddit. This is one of the reasons why I’m steadily growing to hate this place.

                            1. 5

                              The one-liners seem to be a consequence of showing karma, which I plan to avoid in my upcoming site.

                              Also, howdy. I was that fool on HN who challenged you to the xsv coding duel. It was a pleasant surprise to see you here, as well as minimaxer and a few other HN hats.

                              Would you write up a few thoughts on what you’d like to see out of a new site? (Or, alternatively, a few reasons why you are steadily growing to hate the present one.)

                              It will be a few years before things are significantly different, but the plan is basically to bring HN’s mod tools to the masses. Everyone can make their own HN front page (called a lambda) and moderate/curate in the ways that have proven effective on HN: changing titles, hiding karma, re-ranking stories regardless of upvotes, muzzling individual users, and so on. I’m particularly motivated to avoid the mistakes that plague the current batch of sites (including some of HNs), so your perspective would help.

                              1. 6

                                Hiya. :-)

                                This is kind of a bit of a tangent for this thread, so I’ll just keep it brief. Basically, I would like to see a technical forum that is heavily moderated. I don’t necessarily mean moderation from the perspective of “let’s have a strong CoC,” but rather, moderation in the sense that “discourse should be high quality.” It’s an explicit intent to increase the barrier of contribution by demanding higher quality discussion.

                                The forum that comes to mind is r/AskHistorians. They have a very heavy handed approach to moderation, and as a result, I can go read discussions in that forum with virtually zero noise. It’s excellent.

                                There are serious downsides to this. For one, it requires someone willing to do the moderation work. Speaking from experience, this is freaking hard. Secondly, I don’t know how many people would be willing to participate in such a forum. There would be really hard questions about who gets to judge quality. r/AskHistorians generally gauges quality, from what I can tell, based on citations/sources and, to some extent, pedigree. I don’t know whether that would carry over nicely to a tech-focused forum, but it feels possible. Basically, stop blabbing shitty/low-effort opinions and start grounding them in experience (or others experiences) that others can learn from.

                                1. 10

                                  Hi,

                                  I would personally encourage anyone who wants to build a community like what burntsushi describes to try it. I can’t speak for pushcx or alynpost in this, but I don’t think Lobsters should regard new communities as a threat to its own prominence; rather, I think every community should understand that different places serve different purposes and every community benefits from the presence of the others.

                                  I do, also, call on crustaceans to continue working towards the ideal of high-quality conversation. I don’t think this has to mean there are never jokes; we shouldn’t ask people to stop being people just because the subject matter is technical. I think you all do an astonishing job of it, when we consider what technical forums which don’t prioritize depth and nuance look like.

                                  I wish I could volunteer to be part of burntsushi’s proposed effort; I do think it would be an interesting experiment. Unfortunately, in all honesty, the attention I’ve been able to give to Lobsters lately has already been suffering due to me prioritizing support and activism communities that I’m also a part of. I don’t think it would be fair to anyone to further divide my attention.

                                  Good luck!

                                  1. 2

                                    <3

                                  2. 1

                                    Personally I’m not completely against this kind of comment because they can (not always) spark interesting discussion (e.g. @pcy latest response). I also think discussion risks becoming very artificial/unnatural without them and I imagine an environment like you’re suggesting would be intimidating to a number of people who might otherwise have valuable input but lack self confidence.

                                    1. 3

                                      I mean, yeah. Clearly a lot of people like those sort of low substance pithy one liners. They get upvoted all the time here. That pattern is what I don’t like.

                                      Lobsters is what it is. I’m not out to change it. I was asked my thoughts on what a different community might look like, so I answered. And yes, I explicitly acknowledged that it raises the barrier to contribution. There is no free lunch. I’m sure there are ways to inspire confidence, but at the end of day, you still wind up excluding low quality content.

                                      r/AskHistorians proves it’s possible. The experiment is carrying it over to a tech focused forum.

                                      1. 1

                                        I would say that the experiment is really whether your moderation team have the good judgement, time, and persistence to be the ones to do it. I’ve no doubt it’s possible, for the right people.

                                  3. 1

                                    [edit: Oops, replied to wrong post. LOL, confused by nesting!]

                                    1. 1

                                      minimaxer and a few other HN hats

                                      Hi shawn. Welcome to Lobsters!

                                      I have never seen that HN user, don’t use HN at all myself, and haven’t for years. Even back then, I didn’t have an account. Just to be clear. Clever name, though :-)

                                      Your new site sounds interesting. I’ll look out for it. I wish I had time to help out. I think that if you give everybody effective mod tools and ownership of their personal spaces, you can probably do without “karma” or similar gamification schemes altogether. Lobsters’ public invite tree does a great job at mitigating spammers and bots, and it’s such a simple thing. Public mod log is a very good feature as well.

                                    2. 4

                                      Thank you for writing this insightful and informative reply to the pithy one-liner.

                                    3. 6

                                      I feel like this is a bit of cheap shot

                                      Yes, mea culpa, sorry. (Although I’m currently writing something for a system that has 4 megabytes of RAM, and seeing that that amount of code is needed to do something relatively simple feels a bit weird.)

                                      and the defaults of rustc might differ from your expectations (e.g. debug symbols, stdlib, jemalloc, which is easily 2 megs).

                                      I actually compiled with the --release flag, and with LTO turned on (which I’d expect to eliminate unused code).

                                      I am aware of the fact that some people do use Rust and emit small binaries. For example, the tools made by the demogroup Logicoma are all written in Rust (except for their synthesizer), and yet the output is around 64 kilobytes.

                                      if it were in C, I’d be interested to see how much code you need to pull in (or write yourself) to get to an equivalent level.

                                      Hold my beer. :)

                                      1. 4

                                        Thanks for the reply :).

                                        Well, I deployed full tokio/serde applications on 8MB, so this is definitely feasible even with the large frameworks. libstd and alloc do have an undeniable base cost. It’s interesting how the growth of a Rust application is very much “steep, and then quickly getting smaller”.

                                        Hold my beer. :)

                                        I’ll also buy you the next one, feel welcome!

                                        1. 7

                                          I delivered: mdu.c. 221 lines (as counted by cloc). I used only libc, and even went as far as trying to use syscalls only. The result isn’t very “industry-grade”, but that wasn’t really my goal to begin with.

                                          It spawns new ‘threads’ using fork(2), and communication is done using pipes (read(2), write(2) etc. are atomic). The main thread tells a worker thread which directory to (non-recursively) look at, these worker threads send newly discovered directories back to the main thread, which puts them in a queue. The worker thread signals it has done counting the file sizes of a single directory by letting the main thread know how large that dir is. The main thread then looks at the queue, and assigns one new directory to each idle threads. (Or just read the source.)

                                          But how big is it, and does it work?

                                          $ ll . # Including all the comments, of course.
                                          -rw-r--r-- 1 9946  mdu.c
                                          $ gcc -O3 -omdu mdu.c && wc -c mdu
                                          17744 mdu
                                          $ strip -s mdu && wc -c mdu
                                          14536 mdu
                                          $ git clone https://github.com/sharkdp/diskus # Let's download some testing material (looks like they changed the name)
                                          $ ./mdu
                                          28578
                                          

                                          14k is still a bit much, don’t you think? Let’s try cheating a little:

                                          $ sstrip -z mdu && wc -c mdu
                                          12533
                                          

                                          (sstrip)

                                          Hmm, not bad. But let’s try something better/cheatier: executable compression. I’m using my own unpacker, as UPX etc. wouldn’t do much good, as the input binary is already quite small.

                                          $ xz -9 --extreme --keep --stdout mdu > mdu.xz && cat $foo/vondehi mdu.xz > mdu.vndh && chmod +x mdu.vndh && wc -c mdu.vndh
                                          3590 mdu.vndh
                                          $ ./mdu.vndh # Including the .xz file etc.
                                          32136
                                          

                                          And now it’s less than a single page :).

                                          In the large comment at the top of the file, I explained how to make it even smaller (it’s probably not that hard to cut the size in half), but I’m too lazy to do it.

                                          EDIT: Of course, it’s linking dynamically to libc. The actual thing must be multiple megabytes, right?

                                          $ musl-gcc -static -O3 -omdu.musl mdu.c && wc -c mdu.musl
                                          34320 mdu.musl
                                          

                                          And if you get rid of all the syscalls, you can make a much smaller static binary anyway.

                                          EDIT2:

                                          I’ll also buy you the next one, feel welcome!

                                          I actually don’t drink anything containing alcohol, but thanks anyway.

                                          About the “trolly oneliner” conversation: I’m actually surprised that silly comment got that many votes in the first place.

                                          1. 2

                                            I must admit that I didn’t have time to read through it yet (was sick over the weekend), but thanks for writing it already. I’ll have a check :)

                                            I actually don’t drink anything containing alcohol, but thanks anyway.

                                            Any other beverage will do ;).

                                            About the “trolly oneliner” conversation: I’m actually surprised that silly comment got that many votes in the first place.

                                            I’m not, sadly. That thing flies so well on the internet, because it triggers the “cool, agree” emotion, even without looking at the subject of the comment in detail.

                                  4. 2

                                    You can pry ncdu from my cold, dead hands

                                    1. 1

                                      And there’s already a du replacement called ‘dup’: https://github.com/ritze/dup

                                      1. 1

                                        That’s a wrapper that hasn’t been touched in years.

                                      2. 1

                                        See also this that made it to this site five or six months ago, but I can’t find the submission.

                                        https://github.com/bootandy/dust

                                        1. [Comment removed by author]

                                          1. [Comment removed by author]

                                            1. 2

                                              Who are you talking to? I see nothing to indicate that the submitter of this post is the author of the tool. Go file a github issue, or better yet a pull request.

                                              Sorry, I haven’t checked if the OP is the author, will create an issue on github.