1. 59

  2. 3

    This process can be expressed as a couple of binary operations, by first mapping the input as whitespace (0) and non-whitespace (1). Shifting that binary value by one digit and inverting it produces a mask that can be used to find word boundaries, which computers are great at counting.

    Wait so, since you’re already tokenizing and parsing input why is it more costly to increment a counter?

    1. 2

      In simd.h

      static inline int count_words(simd_vector vec, wcount_state *state)
      	simd_imask ws = simd_imask_from_mask(simd_cmpws_i8_mask(vec)),
      	           first_chars = ~ws & ((ws << 1) + *state);
      	*state = ws >> (sizeof(simd_vector) - 1);
      	return simd_imask_popcnt(first_chars);

      You can find what simd_imask_popcnt actually is in that file. POPCNT, especially the simd versions, are much faster than looping and incrementing a counter.

      1. 1

        In fact GCC an LLVM even try to find code implementations of POPCNT to swap it for for one instruction.

    2. 2

      Further, the behaviour of iswspace() from the C standard library might surprise some, as it does not report non-breaking spaces as spaces, unlike many similar functions in other languages, or what intuition might perhaps suggest.

      IIRC iswspace() is supposed to vary its behavior according to active locale, and in the default locale the set of whitespace is just the ASCII whitespace characters. So even though U+00A0 is whitespace according to Unicode properties, iswspace() by default doesn’t know that.

      1. 1

        It’s funny everyone is optimizing wc; I see two posts on wc today one beating wc with Go and now this

        1. 12

          It started with Beating C with 80 lines of Haskell, which kicked off the fad. I particularly like this one because everybody’s been comparing optimized $LANG to unoptimized C. This one is about what happens when you try to optimize the C.

          1. 1

            Yes. it seems the monothread optimized C version is roughly 20 x faster than the unoptimized C version. And roughly 60 x faster using the multihtreaded C optimized version .

            If I have read correctly. I just skimmed the article rapidly. But you know… With undefined behaviours, there can be some really difficult mistakes from time to time ;-)