1. 34

  2. 10

    I wish this explained what tac was at the start, but it seems it’s https://man7.org/linux/man-pages/man1/tac.1.html (cat of reversed lines).

    1. 7

      Sorry! There’s this at the end of the first paragraph but maybe it’s not as helpful as I hoped it would be?

      For those that aren’t familiar with it, tac is a command-line utility to reverse the contents of a file

      Happy to revise it with any better suggestions. Maybe I should add a full paragraph to explain what tac is and why you might use it - I’ve been told my posts are too long in the past :(

      1. 6

        Perhaps an older/wrong version was published, because that line isn’t on the page.

        I think that sentence would do just fine.

        1. 8

          My apologies! It turns out my nginx caching configuration isn’t correctly purging when a page is updated. It should be good now - thanks for cluing me in to what was going on.

    2. 4

      You may want to specify in which processor you measured this, because I tried this (Skylake) on a 1 GB file with and without the bmi2,lzcnt annotations and there is barely any measurable difference.

      The major performance issue here, rather than lzcnt/bsr which perform pretty much the same, and bzhi which can be emulated at similar cost, seems to be branch misprediction from the matches != 0 comparisons. Unrolling the loop twice, reading one cache line per iteration, and creating a 64-bit mask which you can treat the same way as now, would be an easy way to cut those bad branches in half.

      1. 4

        Without a doubt the biggest improvement came from consuming the input 256-bytes at a time via the AVX2 instructions and not from the BZHI/LZCNT routines. Note that the slowdown wasn’t from the BZHI emulation (it wasn’t emulated) but rather the constant function calls and associated register setup/teardown to the function that actually carried out the BHZI call. I can’t remember the exact numbers I was getting without those two decorations, but hyperfine always reported the native CPU build to be faster than the default x64 build without those decorations. That particular work was tested on a Xeon E3-1545M v5 (so also Skylake).

        The core loop currently processes 32 bytes at a time (the AVX2 256-bit word size), or half the likely cache line size. Your suggestion to make it handle 64 bytes per go around via two separate sets of AVX2 operations seems sound. BZHI can operate on a 64-bit input directly, so that could actually be reused directly without needing to manually shuffle the mask bits around. Thanks!

        Edit: I took your suggestion and saw around a 14% improvement for long input files with short lines as compared to the unrolled (32-byte inner loop) version. Seems like a solid win! I’ll play around with more unlooping scenarios but I think the impact will depend on line length.

      2. 4

        Nice! I really want to port this to arm64 (neon) now…

        1. 3

          I was hoping no one would bring up the lack of NEON support, lol. You should go for it, it’s a fun challenge!

        2. 3

          Can this tac handle multiple files? I ask, because I still giggle sometimes at the last cat re-implementation I saw, which had the caveat that it only concatenated one file.

          1. 1

            Yup, they’re handled in the same order they’re named. And good question!

          2. -5

            Assuming I know what tac is. Assuming I have a problem of grepping through log files that are huge(!) in reverse order. Assuming the task is time critical. Assuming that the performance is limited by the tool and not IO.

            Finally we arrived where a simd tool would be good. Too bad you lost me 3 assumptions sooner.

            1. 10

              It’s okay to simply not comment if an article isn’t relevant to you. The author made this tool for themselves, evidently satisfying all 3 of these assumptions, and was kind enough to share it. Moreover, even someone without a need for any version of tac would still find this article useful if they wanted runtime SIMD detection in Rust.

              1. 5

                This is rude and dismissive. Please do not post comments like this in future.