1. 8
  1.  

  2. 4

    I wonder how much of the overhead is converting from UTF-8 on disk to UTF-16 in memory. I was thinking that the new Java 9 feature allowing 8-bit string encoding could potentially speed this up, but upon further inspection it appears that feature uses ISO-8859-1 internally instead of UTF-8, so you can’t just take bytes and use them directly. Needless to say I was very surprised to see any new code being written using ISO-8859-1: http://openjdk.java.net/jeps/254

    1. 2

      I don’t know much about Clojure, but this behavior shouldn’t be surprising. Reading text input requires converting incoming binary data to text.

      In Common Lisp or C++ I’d probably around this using binary IO to read large blocks of data into memory without transforming it, and then giving the JSON decoder a memory stream with appropriate codecs/format settings to read the data.

      On a side note, not a single image on that page loads for me (Chromium on Linux). If I right click on the placeholder and select “open in new tab” they open in weird sizes.

      1. 1

        Medium does a bunch of tricks with images at various sizes that get overwritten with javascript as the page loads.

      2. 2

        tl;dr - allocation is expensive and clojure magic can get really expensive when applied on a byte-by-byte basis.

        I really got nerd sniped on this one. So the problem is the author, intentionally or unintentionally was benchmarking a bunch of wildly different things.

        wc -l counts lines. All it needs to do (assuming it’s not using mmap and doesn’t care about encoding) is allocate a single buffer and re-use it, after every read it needs to walk the buffer and count occurrences of a specific byte.

        • Clojure’s line-seq likely allocates and decodes into UTF-8
        • Clojure byte-by-byte is actually likely slow due to syscalls and the reflection clojure uses.
        • Java’s .readLine() allocates and decodes into UTF-8

        I actually found it really difficult to write a fast clojure version that looked at every byte using generic functions (filter, reduce, etc) on a byte[]. Clojure seems to end up getting stuck doing reflection on every single boxed byte of the file. However clojure is quite capable of utilizing a BufferedInputStream to do reasonably fast binary I/O. As long as you don’t inspect the contents of the buffer.

        Jumping away from java/clojure it looks like some languages, like Go, are able to use special cpu instructions to further accelerate searching for a byte. https://golang.org/src/runtime/asm_amd64.s#L2031

        The author’s final solution involves preprocessing the data and counting in parallel. I haven’t tested this, but I would expect to get a modest speedup on one input file by maintaining a small pool of buffers and passing each buffer to a counter worker thread.

        I nerd sniped myself while being nerd sniped. On my laptop this multithreaded go program: https://play.golang.org/p/Ij4gO5L5nd beats wc -l. I can process 4.6GB/s using this pretty simple go program vs 3.5GB/s using wc -l.

        1. 2

          String matching is ridiculously optimized and benchmarked. See this list of 82 string matching algorithms or this article on strstr with SSE 4.2.

          1. 2

            All it needs to do (assuming it’s not using mmap and doesn’t care about encoding) is allocate a single buffer and re-use it, after every read it needs to walk the buffer and count occurrences of a specific byte.

            My takeaway from reading this is that “assuming it doesn’t care about encoding” is almost always wrong for “standard library” type tools like this, but there’s a significant price to pay for being on a runtime that bakes assumptions like encoding into nearly everything to help you avoid making mistakes like this.

            Of course, it’s good that you have a way out for when you are specifically willing to give up the guarantees of correct encoding (say, a few false positives from multi-byte codepoints that contain 0xa aren’t going to hurt), but if that decision is masked from you in the name of speed, you’re eventually going to be in a world of hurt trying to find out why.

            1. 1

              I like paying for abstractions when I want them. If you’re dealing with strings they should be unicode strings (ideally encoded using UTF-8), and your string functions shouldn’t operate on a byte/uint8. If you’re dealing with bytes they should be bytes, especially because that’s what’s on the filesystem and that’s what comes from the network.

              Now I’m curious how fast a unicode-aware implementation could run. I think making it multithreaded is significantly harder since utf-8 is variable width so you can’t just read 1mb and pass that off to a thread without risking splitting a codepoint or a grapheme.

              1. 2

                If you’re dealing with bytes they should be bytes, especially because that’s what’s on the filesystem

                Sure, but it seems like the wrong default for wc, which is specifically about words. Words are a construct of language, not something you can determine by looking at the underlying numbers of the representation. For instance, wc treats “word count” as a single word, because it contains a non-breaking space. (edit: somehow the comment submission replaced my non-breaking space with a normal one, but you get the idea.)

                Arguably wc is just a lost cause, partly because it comes from a simpler time, but also because the idea that it’s always possible to determine word boundaries is just hopelessly naive. Determining the boundary of a word in Thai is still an open area of research in computational linguistics: http://www.thai-language.com/resources/slayden-luqman-2010.pdf

          2. 1

            enjoyed this quick take! the pmap solution seemed like the quickest win to me, but I would be interested to see a solution that automated the splitting as well.