1. 70
  1. 28

    Beating decades of optimized C with decades of optimized Haskell.

    1. 13

      On my system, and for comparison:

      $ time wc w.txt
       3156098 6312196 380648004 w.txt
      
      real 0m1.358s
      user 0m1.277s
      sys 0m0.066s
      

      a solution in q s about twice as fast (783msec):

      q)\t g:{(sum[1<deltas where x]+sum[not x:max x],0),sum max x 0 1}0x0d0a2009=\:read1 `:w.txt
      783
      q)g
      380648004i
      6312196
      3156098i
      

      Now that’s cool, but what’s really cool is that’s not some trick: q is an interpreted language, and I didn’t have to do any weird tricks to get high performance. Wow.

      While I think if you’re already stuck in Haskell, knowing the kinds of tricks in the article to get acceptable performance is good, but if you’re looking to invest in a new language those same tricks look a lot less appealing.

      1. 6

        Did you rerun wc to make sure it is not the fs cache in the kernel that makes it fast?

        1. 2

          These times are the best of three runs each.

        2. 7

          I don’t know q, but it seems that your code is only looking for byte-level ASCII whitespace (0x20, 0x09) as a word separator. But this gives incorrect word counts on non-ASCII data with non-ASCII whitespace.

          Keep in mind in all comparisons to GNU wc that it does extra work, detecting multi-byte characters and decoding multi-byte characters if present, to correctly count the number of words. perf shows a significant amount of time being spent in multibyte character handling. If you trigger a code path that does not do decoding beyond the byte level, it’s much faster:

          $ time wc wiki-large.txt 
          854100  17794000 105322200 wiki-large.txt
          wc wiki-large.txt  0.42s user 0.02s system 99% cpu 0.438 total
          $ time wc -l wiki-large.txt
          854100 wiki-large.txt
          wc -l wiki-large.txt  0.02s user 0.02s system 98% cpu 0.034 total
          

          (wc -l looks at every byte, but does no decoding.)

          From a quick glance, this is also where the article fails. It’s comparing apples to oranges, the ByteString implementation does not do the same as GNU/macOS wc and returns incorrect results in the presence of non-ASCII punctuation. The article incorrectly states that wc will handle input as ASCII. Unless you do not use a multi-byte locale, macOS wc uses the combo of mbrtowc and iswspace.

      2. 25

        I’m not sure wc has had “decades of optimization” – lots of old Unix utils are plagued with silly issues like small hardcoded buffers – but this is still a good blog post.

        1. 20

          Plus it seems he used macOS’s version for comparison. I’d be interested in how GNU wc(1) holds up, since they tend to overengineer their coreutils a lot more.

        2. 22

          I really liked the part about the monoids- it cemented the power of the abstraction in a way I hadn’t seen before. In order to show that an algorithm was safely parallelizable, he just needed to find a monoid. That’s super cool.

          Not sure how much I believe his conclusion, though. He linked the wc code, and it doesn’t really looked optimized at all.

          1. 3

            Yeah the monoid abstraction is one of the single most powerful (and simple!) abstractions that’s helped me even in languages outside of Haskell. You don’t need a fancy type system to make use of it either - if you can prove something satisfies the laws then you as the developer can use that knowledge to make engineering decisions.

          2. 12

            Right, it doesn’t.

            • as already mentioned, the C version is not particularly “optimized”, other than in “ran thru optimizing compiler” sense
            • still slower per core than C version
            • makes no sense for workloads that can’t be parallelized
            1. 11

              I’m not sure hogging all your cores is the best thing in tools that are usually applied in concurrent pipelines. But other than that, neato.

              1. 10

                This was a great read all the way through, thank you to the author! Yes, some of the claims are a bit exaggerated, but it seems pretty clear from the style of writing that it’s not to be taken too seriously.

                The things that stood out to me:

                • The fuzzy way of adding strictness annotations to get things to work well seems scarily familiar. It works, no question about that, but it would feel better to understand precisely when and where you should add them. I’m not sure if what’s missing is just another level of Haskell expertise, or whether compiler optimizations etc. are unpredictable enough that that’s what everyone is up against.
                • Great monoid use-case. I think the word-counting monoid itself would have been better explained without reference to the Flux abstraction; it feels easier to understand as a stand-alone thing. (Also, the Unknown constructor and its monoid action feels a little off: If it’s called “unknown”, I’d prefer Unknown <> x = Unknown; with the current implementation perhaps call it Empty?)
                • There were a lot of exclamation marks!
                1. 1

                  I like your comment about the random banging. There is talk in the Haskell community that maybe laziness isn’t the correct default, thus -XStrict and -XStrictData: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/glasgow_exts.html#extension-Strict

                  I personally hate random bang patterns and prefer strictness by default.

                  1. 2

                    Idris went strict by default. It’s a really interesting language if you’re a haskeller.

                2. 7

                  Great post, keep them coming. Especially agree with the penultimate paragraph:

                  Haskell as a language isn’t perfect, but if I can get ball-park comparable performance to a C program while writing much higher-level fully type-checked code then I’ll call that a win any day.

                  1. 9

                    Sure, warming caches is nice for benchmarks. But who really runs wc on the same file even twice? Maybe it would make a lot more sense to optimize for the cold cache?

                    1. 5

                      This… doesn’t exactly seem like a great advertisement for Haskell. The much-vaunted cool parts about Haskell are more on display in the first attempt, which is agonizingly slow. It seems that getting decent performance requires overriding stuff like laziness that we were told makes Haskell great, using even more esoteric syntax.

                      What I’m saying is, I’ve had a vague feeling for a while that I ought to actually sit down and learn Haskell one of these days. This article makes me more inclined to think that I shouldn’t bother.

                      1. 4

                        Hmm, interesting! I wrote a clone of wc a while back using a parsing method that involves on a single branch, so it’s blazing fast (because there’s nothing for the branch predictor to stall on!). A friend measured it and it beats conventional GNU wc by a factor of 20. I wonder how it would perform against this.

                        1. 3

                          I didn’t understand any of the jargon related to that monad stuff, but I think I sort of got the idea - you have a structure which knows if the character before it was a space and if the character was a space, and given that info and a chunk of text, you can correctly word count. That’s pretty neat (though a struct counts countChunk(FILE *f, size_t chunklen, bool leftIsSpace) could accomplish the same job without all those monads and semigroups and whatnot).

                          In practice, this haskell wc utility doesn’t make much sense for a bunch of reasons (slower per core than C, wc is usually used in a pipeline which means using more cores takes CPU time away from the other tools in the pipeline, GNU wc might still be faster, etc), but I don’t think the idea behind this post was that we all should switch to the author’s re-implemented wc for our word counting needs.

                          My take-aways from this post is that you can get close to C in haskell if you just know how the language and runtime works, and that the formal mental models and fearless concurrency means it’s easier to take advantage of multiple cores for problems which may actually need it (even if it doesn’t make sense for wc).

                          It would be really interesting to test this against GNU wc on my machines, but the author doesn’t seem to have uploaded the complete source code anywhere, and I don’t know haskell nearly well enough to be able to reconstruct it from the snippets in the post.

                          1. 5

                            The word is “monoid”: wikipedia.

                            1. -10

                              Sure, I don’t particularly care about category theory bullshit. A monad is just a monoid in the category of endofunctors anyways.

                            2. 3

                              though a struct could accomplish the same job

                              That struct is a monoid whether or not you know what that means. But as the post relates, if you do know what it means, you’re more likely to realize you need that struct.

                              1. 1

                                It’s just a function which returns a struct counts. You may be right that it’s a monoid, but could you explain why? I just see it as a function which takes a file stream and options and returns some stats. According to the Wikipedia page, “a monoid is an algebraic structure with a single associative binary operation and an identity element. “; I don’t see how my function is an algebraic structure, I don’t see how it has a single associative binary operation, and I don’t see how it has an identity element?

                                1. 3

                                  countChunk isn’t enough — you also need combineChunks. That, plus an “empty” chunk, is what makes it a monoid. (Empty chunk = identity element, combineChunks = associative binary operation.)

                                  He defined an “empty” chunk called Unknown, and a function flux that turns a character into a chunk. He also defined a function <> that combines two chunks. That combination of things makes it a monoid and gives you a way to get characters into it.

                                  Once you’ve defined how to combine chunks, and how to turn a character into a chunk, you know how to turn a stream of characters into a chunk, or (as he did) break a file full of characters into pieces, turn each piece into a chunk, and combine those chunks into a final chunk.

                                  The nice thing about thinking of it as a monoid is that any code you have to divide up work (and perhaps do it concurrently) will automatically work on your monoid, because all that code will depend on is the ability to make [whatevers] and combine two [whatevers] into another [whatever] (i.e., the definition of Monoid). (In this case, said code the combination of fold and forConcurrently that breaks up the work across the cores.)

                                  And my point was that knowing such code for dividing up work exists is what will prompt you to implement this counting operation as a monoid in the first place — which it can’t, if you don’t know what a monoid is.

                                  In any other language you might have the clever idea of recursively dividing the text into chunks and combining them back together — but you don’t have to be that clever if you’re accustomed to using monoids regularly.

                              2. 3

                                Your proposed “struct” based solution probably is a monoid.

                              3. 2

                                One may not expect parallelizing to multiple cores to do a whole lot since presumably this whole operation is IO bounded

                                I’m not so sure this is surprising at all. At some point the limit becomes how fast you can shovel bits across data lines to the CPU, and with more cores you should get broader bands in aggregate!

                                1. 2

                                  I hope this doesn’t lead to a mini-flood of “here’s how I optimize wc in my favorite language” posts…

                                  1. 29

                                    sound of many wc.rs files closing

                                    1. 2

                                      But seriously @BurntSushi’s DFA library would probably make a really nice, simple implementation :)

                                      1. 3

                                        For reference, I think you’re referring to https://docs.rs/regex-automata

                                        You can see the regex I wrote for word segmentation: https://github.com/BurntSushi/bstr/blob/master/scripts/regex/word.sh

                                        A shell script generates the DFA via ucd-generate: https://github.com/BurntSushi/bstr/blob/083d609e16fe9c850b2e08708c0d2ecb191d44d1/scripts/generate-unicode-data#L47-L56

                                        It then gets embedded into the binary and is usable as-is with no decoding step: https://github.com/BurntSushi/bstr/blob/master/src/unicode/fsm/word_break_fwd.rs

                                        There’s almost certainly some optimizations one can perform on typical text that will speed things up dramatically compared to just naively running the DFA as-is. But that isn’t really a problem I’ve deeply investigated to be honest.

                                  2. 2

                                    It was a bit disappointing that they had to reach for non-standard extensions to get even somewhat acceptable performance.

                                    1. 4

                                      It doesn’t, though. The only extension that’s used (BangPatterns) isn’t used in the final version. And that’s as „standard“ an extension as you’ll find, and easily replicated by using the seq function instead to force evaluation.

                                      1. 8

                                        It is used, along with other extensions. Unless dependencies don’t count.

                                        Edit: the reason this type of article tends to disappoint me is that high-performance Haskell often seems to require (directly or indirectly) GHC-specific knowledge and extensions and idioms that results in a language and a code that is quite different from the elegant, easy to understand language/code that is used in Haskell marketing. At that point, I’d rather implement the critical bits in Rust or Zig or whatever, where I can write idiomatic code that performs well, and then expose that to Haskell.

                                        1. 3

                                          Shrug, BangPatterns being an extension instead of part of the language is just because the language has a (rather old) standard and a mechanism for extensions.

                                          Please elaborate on those other extensions. What is being used here?

                                          1. 2

                                            high-performance Haskell often seems to require (directly or indirectly) GHC-specific knowledge and extensions and idioms that results in a language and a code that is quite different from the elegant, easy to understand language/code that is used in Haskell marketing.

                                            It was the same with Lisp when I read its optimization articles. I think it’s a hint that the idioms are fundamentally trading against performance for other benefits. Which is fine. It does hint, though, that one could make a Lisp or Haskell that had good performance by starting more bottom-up looking at and embedding whatever that requires.

                                            Closest thing in Haskell is Habit programming language that HASP worked on. There was a Galois talk in 2017 whose abstract mentioned “edging toward a broader release.” Idk current status, though. If it gets released, it might be interesting to re-run the exploration of this article using Habit to see what effects that has. I mean, it would be too unoptimized vs GHC for fair comparison of performance. However, we could see if its default coding style was still performant vs C code. That’s meaningful to me.

                                            1. 5

                                              Funny, as a Lisp developer who’s previously tried Haskell, I don’t see it that way.

                                              90% of the time I don’t even have to optimize Common Lisp, but when I do can often get “good enough” performance simply by adding a few type declarations and marking some functions as inline. Doesn’t require any intimate knowledge of the Lisp compiler or runtime, and it only codifies what I already know about the code. Other than a few “(declare (type …))” sprinkled around, the code doesn’t change much at all.

                                              Eleminating memory allocation is the other big area for optimization, and that admittedly does require more understanding of how Lisp works, and more thought about which calls might allocate. Most of the time, though, reducing allocation doesn’t require any implementation specific workarounds, and with a little experience the functions that allocate are easy to spot.

                                              At the end of the day, optimized Lisp looks nearly identical to un-optimized Lisp, but has type declarations, and might be rearranged a little to avoid memory optimization.

                                              On the other hand, optimized Haskell (IME) seems to be completely different code that tries to avoid using a lot of Haskell features (laziness, no side effects, etc.).

                                              EDIT: Thinking about it more, I remembered that IronClad uses SBCL’s VOP mechanism to generate optimized machine code in some special cases SBCL doesn’t optimize by default. So implementation specific optimizations exist in Lisp too, but they’re not as common as in Haskell.

                                              1. 2

                                                I should probably have said it was some old articles I read that in. Things might have improved a lot. Allocation was one of main things they mentioned. Appreciate the modern take on the situation.