1. 62
  1. 19

    Looks like he hit 56 GB/s when it was run on the AMD 5950x. The author of the code said he spent months working on this. Incredible dedication.

    1. 5

      Yeah but can they invert a binary tree on a whiteboard? If not this person doesn’t stand a chance of getting hired anywhere /s

      1. 1

        It’s a shame as the inverted binary tree presentation task is well known and less-skilled devs can just practice it ahead of time. I can’t imagine being asked to write FizzBuzz while avoiding the heap while being given the freedom to use Linux API calls that would crash the app if its output is not piped into another application. The author of the answer potentially found a Kernel bug while working on this. Even after seeing this a few days ago, I’m still seriously impressed with this person’s work.

    2. 6

      Comparison timings of dd if=/dev/zero bs=X | pv -v > /dev/null on my system (ryzen 5 3600)

      bs rate
      1 2.4MiB/s
      16 37MiB/s
      256 565MiB/s
      4096 5.6GiB/s
      65536 7.8GiB/s
      524288 3.8GiB/s
      
      1. 2

        The FizzBuzz code avoided the heap, kept everything in L2 cache and both minimised syscalls and picked some seldom-used ones which brought better performance at the cost that the output had to be piped to another application.

        It’s worth running perf and strace with your above dd command. I suspect all three of the above optimisations are missed out on.

      2. 5

        I believe the top-voted post is from the Benchmarks Game guy.

        Regarding the tens-of-GiB/s one… I wonder if it could do better on ARM? Or maybe even as its very own monotasking OS? ;) But for now, I think I’ll just bask in the glory of reading that amazing answer.

        1. 3

          I’ve printed his entry and there are some neat tricks I learned, but ‘stage 3’, where, according to his comments, a bytecode interpreter is used, is still beyond my understanding. Also not sure why this helps, a bytecode interpreter sounds like a lot of overhead to me…

          1. 3

            The overhead of interpreting the bytecode is compensated by the bytecode instructions not taking up much memory and the interpreter using SIMD instructions to do the calculations.

            Some comments that highlight the reasoning for this:

            32 bytes of bytecode can be loaded from memory, interpreted, and have its output stored back into memory using just four machine instructions

            the same CPU instruction [is used] to output digits from multiple different line numbers, because the bytecode is being interpreted in a SIMD way

            Using bytecode also allows you to specialize / unroll individual parts of the total executed code, which you can see in the thirty_bytecode_lines subroutine.

          2. 1

            Why did nobody use a GPU? :D

            1. 1

              It’s not a parallelizable task.

              1. 1

                It says otherwise in the original post;

                Parallel implementations are allowed (and encouraged).

                Its parallelizable. I could tell you the output for any number without dependencies. Compare that to the Fibonacci series where every item of the series is dependent on the item before.

                Of course one could argue that its only allowed to parse one number at a time, but you could make the same restriction when adding two arrays. Its not inherent to the problem.

                1. 3

                  Fair point, and definitely worth exploring. My bad. However, I’d wager the main bottleneck still ends up being I/O. Not sure speeding up generation of output would improve the I/O figures.

              2. 1

                Scores are from running on my desktop with an AMD 5950x CPU (16C / 32T). I have 32GB of 3600Mhz RAM.

                Might just be that the scorer isn’t set up to test GPU solutions.

                1. 1

                  I wonder if that might bottleneck on transferring the results off the card? PCIe 5.0 x16 is good for up to about 63GB/s avoiding to Wikipedia

                  1. 1

                    The author stated the L2 cache write speed was the ultimate bottleneck.

                    1. 1

                      Specifically in this thread we’re talking about hypothetical GPU implementations, not the author’s actual very fast CPU implementation.