1. 68
  1.  

  2. 17

    This is so cool! I really like the structure of this post: recognizing something one person has done well (and therefore other people have failed) and then explaining it

    1. 10

      “done well”

      https://github.com/openbsd/src/blob/master/usr.bin/yes/yes.c https://github.com/coreutils/coreutils/blob/master/src/yes.c

      optimizing to the extreme for fun is kind of interesting, but to do it at the expense of clarity with nothing really to gain seems like a loss.

      1. 9

        I really don’t like GNU’s implementation, NetBSD and COHERENT seem to have the most readable yes out of all the yesses I looked over (BusyBox had the worst). It may be possible to apply this to other utilities like dd and cat, which I plan to look into soon (unless someone else beats me!).

        1. 4

          Who on Earth thinks that BusyBox thing is a good idea? I’d hate to see anything even remotely complicated from whomever wrote that.

          1. 5

            It’s super compact both in code size and resource consumption (one stack variable!!), and it’s still relatively easy to understand. I’d say it’s doing its job marvellously.

            1. 3

              Havent had time to look at the code, but alpine linux uses it by default. And it’s targeted mostly to embeded linux, so I’m guessing ultra optimization is more important to them than readability in this case.

              1. 1

                Yeah, that isn’t cool. I thought they were just trying to avoid reusing a variable, then I realised they were reusing a variable, and/or moving on to argv[1] :(

            2. 8

              with nothing really to gain

              One poster on Hacker News suggested this: https://news.ycombinator.com/item?id=14543640

              1. 4

                Classic HN. Always reject the mundane explanation that the program is fast because somebody wanted it to go fast in favor of a narrative involving an epic struggle against corporate overlords.

                1. 5

                  Check the thread again, GNU explicitly asks people to do this: https://www.gnu.org/prep/standards/standards.html#Reading-Non_002dFree-Code

                  1. 1

                    So why did they wait so long to make this change?

                    1. 5

                      I’m rejecting your characterization of that HN comment, because this is a common method for GNU programs. I am not rejecting your assessment of why it changed though.

              2. 5

                This wasn’t done “with nothing really to gain” (although the gain might be subjective). It was performed as a reaction to a filed bug: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20029

                1. 2

                  Interesting, I wonder what the backstory to that is. The example is oddly specific enough (involving a pipeline of yes, echo, a shell range expansion, head, and md5sum), that it look like an unexpected slowdown someone actually ran into in practice, vs. just a bored person benchmarking yes.

                2. 1

                  If “yes” was written once, decades ago, and someone spent all of one entire week validating, I’m ok with getting a 10x performance increase on every *nix system in existence ongoing.

                  I love it when pipelines/shell scripts can scale vertically for a long time before having to rewrite in some native language.

              3. 11

                For me the most interesting part is not the program yes or why GNU yes is faster than the others, but the bottleneck. Why 10 GB/s? kjensenxz speculates that the process is limited by memory bandwidth because a pipe must use memory to pass data through it. That answer is unsatisfactory, because the whole point of caches is to avoid hitting memory, and caches do not care whether the memory is used to back I/O buffers for a pipe or something entirely different. It’s just memory.

                But if you’re writing 8kB buffers at 10GB/s, you’re doing north of a million syscalls. That would leave a few thousand cycles between each call on a CPU running at a few GHz. At the other end of the pipe, it’s worse: you’re doing both a write (to /dev/null) and a read from stdin (EDIT: pv apparently uses splice so there’s no round trip through userspace). Of course all the 8kbytes of buffer still needs to be copied sometime during these cycles.

                It is unclear to me to what extent these processes must run in lockstep, without looking into kernel internals. If the kernel isn’t juggling any spare buffers around, then the processes must run synchronously; there is no useful work for the writing process to do until its output has been passed on and it can start to write again. So how much SMP affects the result is a bit of a mystery. But I can see a considerable slowdown if I force the processes to run on one core. This could come mainly from context switching overhead?

                It really might just be a coincidence that the speed you’re seeing comes close to the theoretical speed of your RAM.

                For another data point, both GNU yes and the fourth iteration of the code piped to pv produce a little shy of 5GB/s on my system running dual channel DDR4-2666.

                I can get similar speeds with dd reading from /dev/zero and piping to pv. A block size between 16k and 64k seems to be optimal; at 128k, the speed is reduced considerably. That sounds like an effect of moving to a higher level cache. Or worse utilization of SMP; maybe a single big copy will block the processes for a longer time.

                1. 3

                  From the HN discussion on why the design is so different than BSDs. It is deliberate to avoid Unix copyright issues: https://www.gnu.org/prep/standards/standards.html#Reading-Non_002dFree-Code

                  1. 15

                    Except of course it wasn’t like this before 2015. Kind of strange they managed to make it all that time without copyright issues, and then suddenly it became a concern.

                    https://github.com/coreutils/coreutils/commit/35217221c211f3116f374f305654462195aa634a

                    1. 3

                      That same discussion links to the initial revision of the code, which does not give this explanation any credibility at all, as far as I am concerned. The code is short and to-the-point and there is no hint of deliberately odd design.

                      https://github.com/coreutils/coreutils/blob/ccbd1d7dc5189f4637468a8136f672e60ee0e531/src/yes.c

                      I can’t be bothered to follow the version history but you’ll probably see cruft accumulated for a variety of reasons, none of which sound like “let’s make it deliberately different to avoid copyright concerns.”

                    2. 1

                      Why does it have to be fast? I’m also kind of surprised that none of the implementations have a sleep in there, does it have to aim for 100% CPU usage? Would it be the worst thing in the world if yes only output 1000x ‘y’ per second and an application blocked for 1ms waiting for the 1001st one?

                      1. 5

                        Why does it have to be fast?

                        Maybe not, but why not? Perhaps someone needed or will need it to be fast for some particular application.

                        I’m also kind of surprised that none of the implementations have a sleep in there, does it have to aim for 100% CPU usage?

                        Why would you aim for less than maximum performance? What problem would you solve by deliberately making the program slower with sleep calls?

                        For most programs, there is no good reason to attempt to control CPU usage from within. As long as there is work to do, do it. That’s why the program was executed, so do the job. When you’re done, you either exit, or you block on I/O (including nonblocking I/O, using libevent or equivalent possibly with a timeout) waiting for the next batch of work. Most tools fit in this form, and it is the OS scheduler’s responsibility to manage the program’s CPU utilization beyond that.

                        There are some counterexamples of course. In animated graphics in particular, you could spend all the resources updating frames that will never be shown. You may block on vsync but it’s not always available, and it’s still unfortunately not very controllable at all.

                      2. 1