1. 9
  1.  

  2. 9

    “Without a single branch” is a stretch… There are no explicit conditionals, but map and fold are certainly branching under the hood for iteration.

    Anyway, here’s a “no conditional” version in Common Lisp. Took about 30 seconds to write:

    (defun wc (file-name)
      (with-open-file (stream file-name)
        (loop 
          for line-count from 1
          for line = (read-line stream nil)
          while line
          for words = (cl-ppcre:split "\\s+" line)
          summing (reduce #'+ words :key #'length) into total-chars
          summing (length words) into word-count
          finally (return (values line-count word-count total-chars)))))
    
    1. 1

      Not that it matters, but does the total characters include the whitespace in your implementation?

      1. 1

        It looks like a bug to me. Would think you’d want summing (length line) into total-chars

        1. 1

          Good catch… A couple other mistakes were that the line count was off by 1, and the total character count was missing all white space.

          Here’s a version that allocates less and runs a little faster:

          (defun wc-new (file-name)
            (with-open-file (stream file-name)
              (loop 
                for line-count from 0
                for line = (read-line stream nil)
                with word-count = 0
                while line
                do (cl-ppcre:do-matches (word-start word-end "\\w+" line)
                  (incf word-count))
                 summing (1+ (length line)) into total-chars
                 finally (return (values line-count word-count total-chars)))))
          
          1. 5

            This version runs quite slowly for me (probably due to the use of regular expressions and explicit splitting). It also doesn’t seem to count words correctly (I get much higher counts than with wc).

            Here’s a Lisp version (entire program included) that runs in about ~1.5x the time of the coreutils version on average (testing with a ~400MB file that is basically a lot of copies of the GPL) and appears to be correct:

            (defpackage wc
              (:use #:cl #:iterate)
              (:export :main))
            (in-package :wc)
            (declaim (optimize (speed 3) (safety 0)))
            
            (defun main ()
              (let ((filename (cadr sb-ext:*posix-argv*))
                    (space (char-code #\Space))
                    (newline (char-code #\Newline)))
                (with-open-file (file-stream filename :element-type '(unsigned-byte 8))
                  (iter
                    (for byte in-stream file-stream using #'read-byte)
                    (for previous-byte previous byte)
                    (for is-newline = (eql newline byte))
            
                    ;; Count each byte
                    (sum 1 into bytes)
            
                    ;; Count every newline
                    (counting is-newline into newlines)
            
                    ;; Count every "word", unless the preceding character already
                    ;; was a space.
                    (when (or (eql space previous-byte)
                              (eql newline previous-byte)
                              (not previous-byte))
                      (next-iteration))
            
                    (counting (or is-newline (eql space byte))
                              into words)
            
                    (declare (fixnum bytes newlines words))
                    (finally (format t "  ~A ~A ~A ~A~%" newlines words bytes filename))))))
            
            

            If you have the Nix package manager installed, you can try this version by running this command (note that this will pull in SBCL):

            nix-build -E '(import (builtins.fetchGit "https://git.tazj.in") {}).fun.wcl'
            
            1. 1

              Well that’s embaressing, I hadn’t thought to compare against wc from coreutils! Using “[^\\s]+” instead of “\\w+” should fix it. The regex splitting was an easy way to avoid explicit branching, but definitely has a performance penalty.

    2. 7

      This is so pointless. Here is wc in awk with 52 characters without a single branch. Took about 10 seconds to write.

      {b+=length($0)+1;w+=NF;++l}END{print l,w,b,FILENAME}
      
      1. 1

        Here’s one in Perl (it doesn’t print the filename, though):

        $l++;$w+=@F;$b+=length}{print "$l $w $b\n"
        

        Explanation:

        • @F in scalar context is the number of fields (NF)
        • length without arguments is the length of current line
        • }{ is the END

        Example:

        echo "hello world" | perl -ane '$l++;$w+=@F;$b+=length}{print "$l $w $b\n"'
        
      2. 3

        I’m fascinated by the branch-less programming style. It is close to array programming (think of APL or Numpy) but the idea of ranges here is stream data and never have everything in memory at once. It is a mindset change to not think about single elements but think about operations on the array/stream as a whole.

        I see a tension with object-oriented programming. In OO you define specific classes/objects with methods specific to your use. In array programming you combine generic operations on generic data structures. Sometimes these combinations get a specific name (creating an abstraction) but often they are left raw.

        Output o = f
        	.byLine(Yes.keepTerminator)
        	.map!(l => toLine(l))
        	.fold!(combine)(Output(0, 0, 0));
        

        An object-oriented approach could be:

        counter = new CountAccumulator();
        for (line in new LineStream(f)) {
             counter.addLine(line);
        }
        

        I’m not sure which one is better.

        1. 2

          Fun! D is one of those languages I’ve had on my todo-list for a while. People seem to have mixed experiences with it, but I’ve heard a lot of good things about the standard library and that’s worth investigating :)

          1. 2

            The same can be done e.g. in Java. But I like D too, it is a pleasant language. Thanks for the post.

            1. 2

              Also see cw (Rust), and a bunch of similar articles about wc re-implemented in: