1. 11

    Well, it’s actually more borrowing than stealing, and Rust is pretty much into that. :-)

    1. 2

      One thing that is curiously absent from the discussion is dependency management. This may seem trivial but is full of interesting edge cases, e.g. multiple versions of a dependency, to which current state of the art has not yet appeared to find an optimal solution.

      1. 1

        Those are discussed in this comment:

        https://lobste.rs/s/cd5lk4/low_hanging_fruit_programming_language#c_8y9bpu - 3rd bullet point.

        Or am I misunderstanding you?

      1. 2

        I wrote about this and more on my blog

        1. 1

          One optimization the author doesn’t address that can be done in plain rust is to get rid of the branches in the loop (there are two: One for the bounds check because of indexing, the second for the if). In fact, I introduced a small optimization to iterators that do the trick (in nightly and possibly beta), so you can simply write:

          bwt[(i * self.k) + 1 .. r + 1].iter().filter(|x| *x == a).count()
          

          This gets inlined into roughly:

          let mut count = 0;
          for x in &bwt[(i * self.k) + 1 .. r + 1] { // look ma, no branches!
              count += (*x == a) as usize; // boolean false is 0 and true is 1
          }
          count
          

          This is the naive way, and also the fastest for small slices. If you want to go even faster, the bytecount crate has the currently fastest implementation, and is used by both ripgrep and xi. If you are on nightly, you can even enable SSE / AVX for best-in-class performance.

          1. 2

            That’s not actually removing the branch in the loop condition, and I didn’t write about removing the inner loop branch because I thought the compiler was able to do that already. I retested that and apparently not!

            Saying it’s branchless because “boolean false is 0 and true is 1” is not really telling the full story, in particular there are some architectures where that would still be implemented with a branch. (e.g. MIPS I’m pretty sure)

            The trick is that CMP on x86 is like a SUB that discards its result, and SUB sets a number of flags depending on the result, including the Zero Flag which gets set to 1 when the result is 0 and 0 when the result is 1. So if *x == a, *x - a == 0 and ZF gets set to 1, and then you can you use SETZ to read that out and add it to count.

            1. 1

              OK, I was overgeneralizing – it appeared to remove the branch when I was doing the Rust PR (and yes, I checked the assembly) on x86_64 (I didn’t have other platforms to test this at the time).

              Also, I didn’t mean the branch in the loop condition – that one branch is generally unavoidable. I meant the out-of-bounds check that is hoisted out of the loop by iterating over a slice.

          1. 9

            Kind of a bummer that it goes directly to C++. I was hoping to see how to finish the job in Rust.

            1. 3

              TBH it would look almost exactly the same. The intrinsics map 1-1 with assembly instructions and the slow loops are nothing fancy, so it would be identical code but with Rust keywords instead of C keywords.

              1. 1

                Actually it would look like bytecount.

                1. 2
                  sed s/\./:/
                  
                  1. 2

                    Thanks, fixed the link.