1. 96
  1.  

  2. 21

    The main takeaways for me are:

    • Go and Rust are both great languages if you want a minimum of safety and pretty good performance.
    • Old and tested UNIX tools written in C are hard to beat when it comes to optimized performance.
    • Python results in okay performance, but the performance per minute spent is good.
    • Go is easier to profile and squeeze more performance out of than C++.
    1. 2

      Go and Rust are both great languages if you want a minimum of safety and pretty good performance.

      Since they have similar performance results for naive and optimized versions, I wonder which one is easier to profile and optimize?

      1. 13

        I was trying to optimize almost exactly the code in the article a few weeks ago at $work in go. The differences were

        • We didn’t really want to compromise on the quality of toLower in non-ascii languages.
        • We had a really long input string, and our definition of “word” in the typical case excludes the vast majority of it. We didn’t want to make a copy of the entire thing.
        • In some pathological cases, we had so many words, that making a bunch of tiny allocations for each was OOMing our process (or a major contributing factor anyways).

        In rust I would have solved with by creating a struct LowerCaseString<'a>(&'a str) with a custom hash function and equality operator that converted to lowercase on the fly while hashing and running eq. Then I could have done this correctly with only the single allocation for the hashmap.

        But it’s go code, and you can’t write custom hash or equality functions for use in the default maps. So the choice was either create/import an entire new custom map type, allocate, or do something clever. As it turns out we could do something clever in this case… we didn’t need to keep a reference to the input strings around, and the very very occasional false duplicate wouldn’t matter, so we just stored a long and secure hash of the string in the map. This is pretty clearly inferior to the potential rust solution IMO.

        tl;dr: Sometimes go is opinionated enough that it makes optimizing code hard. In those cases Rust is usually superior.

        1. 1

          Writing a custom map type in Go is a possibility, though. Go supports any type by using interface{}.

          If that’s not good enough, it’s relatively easy to combine Go with C or Assembly.

          IMO the tl;dr should add that optimizing code in Go is mainly hard only if type-safety is essential for solving the problem.

          1. 4

            No, I wouldn’t agree with that TL;DR. An interface{}, in current Go anyway, implies boxing. And, of course, it doesn’t have low-overhead FFI, so dropping down into C can be tricky. (And can have nasty effects on the broader program.) Assembly works though, if you write Go’s particular flavor.

            But that’s kind of missing the forest for the trees IMO. As someone who has spent lots of time optimizing Rust and Go programs, I would absolutely agree that Go is opinionated enough to impose performance ceilings that are difficult or infeasible to overcome.

            As a reference point, compare my Rust snappy library with the Go snappy library. In Rust, I have to drop down into unsafe in some parts. But I’m still writing Rust. In Go, you have to drop down into Assembly and port it to multiple different platforms. At least on the metric of optimizing code, there’s a pretty clear winner here.

            1. 2

              Those are good points! Now I regret what I wrote and agree with everything you wrote instead.

              The idea of combining Go and Assembly may not be that bad, though, as long ie. only amd64 and arm64 are going to be supported. Of course, the best is to keep code portable and readable and not having to optimize anything “by hand”.

              Just brainstorming here:

              I dream of being able to inline Zig code in Go in a way that results in low-overhead FFI and side-by-side assembly output that can be examined while developing.

              It could also be possible to envision solutions involving JIT compilation, for selecting the best JIT solution at runtime, while the algorithm is running.

              1. 2

                Being able to inline Zig into a Go program with no ffi overhead would indeed be quite amazing.

                1. 1

                  Related project, for Elixir: https://github.com/ityonemo/zigler/

                  talk video

                  The level of integration that dnautics accomplished here is amazing.

        2. 11

          I don’t have a ton of experience with Rust so I can’t speak to it, but profiling and optimizing Go is an absolute dream.

          1. 15

            I have a lot of experience with both and profiling in both. Overall, Go has a better story, primarily because, IMO, its profiling tools have better UX. Getting an easy to view source listing with counters next to each line is really really nice. You can do the same with Rust in principle with perf, but the view is just so much worse. But maybe it’s harder with Rust because of the extreme inlining.

            And being able to easily open up a nice call graph is just excellent too. Much better than perf’s default display.

            1. 16

              Have you tried using https://github.com/google/pprof on a perf recording? It has a nice web UI with graphs and tables and lots of features, looks similar to the Go one. The flow of installing it and using it is definitely not as good as Go though.

              1. 9

                I have not! I will give this a try. I’m working on something with heavy profiling usage right now, so I have a good test case. Thanks for poking me about it.

                1. 4

                  I would be interested to hear whether you think this tilts the odds in rust’s favor (once you’ve had some time to play around of course).

          2. 1

            The same tools will largely work on both, though they may not know how to unwrap fancy data structures in Rust into pleasant human readable form.

        3. 10

          1+ for including Forth.

          I wonder if it’s feasible to transpile Forth to CPython’s VM (or vice versa) since they are both stack based…

          1. 7

            1+

            I see what you did there.

            1. 3

              I’ve actually wondered the opposite before: could Forth be an extensible, bare metal VM for some stack machine based languages?

              1. 1

                Does CPython’s VM contain registers? Forth isn’t just “stack based”, just operations are performed on it. The canonical Forth virtual machine has two stacks.

              2. 10

                Personally, I’d be quite interested in also seeing Nim, Zig, and maybe D versions. Though I assume every reader will have their own set of favourites they’d like to see the most.

                1. 18

                  Here’s a quick “simple” version in Zig, it beats the simple version in C on my machine at 0.37 vs C’s 0.65.

                  https://zigbin.io/73ee75

                  Edited to fix the usual minor mistakes that happen when I write something quickly :)

                  1. 5

                    Beautiful. Zig definitely scratches an itch for me; I just wish I had the time to devote proper attention to it. Maybe someday.

                  2. 4

                    I’m sure the author would appreciate some submissions for those languages.

                    I feel like this type of benchmarks usually suffer from the bias of the author’s proficiency with the tested languages which is not uniform. I’m sure optimized versions from experts of each language would be highly useful to them, and to us.

                      1. 1

                        Indeed. And they did solicit a contribution for Rust from the developer of rgrep- which is rather fast!

                      2. 4

                        I submitted naive and slightly profiled/optimized D-Language versions. There’s probably a bit more to squeeze out (the char to string idup allocation).

                        edit: found some over a cup of coffee.

                        edit #2: found more, now on par with optimized cpp. It’s not very readable anymore:

                        https://github.com/rlonstein/countwords/blob/dlang/simple.d

                        vs.

                        https://github.com/rlonstein/countwords/blob/dlang/optimized.d

                      3. 9

                        To everyone requesting other languages:

                        I’d love readers to send pull requests to the benhoyt/countwords repository to add other popular languages like Java, JavaScript, Ruby, Kotlin, or Swift, and I’ll link them here. I’m not familiar enough with most of those to do them justice anyway.

                        So go forth and code!

                        1. 4

                          A few years back I went and re-coded a form of twc in an attempt to understand Sean Barrett’s lexical analysis method, it turned out to be only 250-ish lines of C code, but several orders of magnitude faster compared to GNU twc.

                          To get this, it was really only enough to minimize the amount of branches in the core loop of the program. I didn’t need to convert to lower-case, or use a hashmap. Basically what it consists of is:

                          • A few integers around to track state (bytes, words, lines, etc.)
                          • Some tiny state transition arrays along with a function to run through them
                          • A tiny function to read the entire file into memory (This isn’t really necessary, and it might be possible to make it faster with mmap() and co)
                          • Some miscellaneous bits and bobs to cope with short and long options

                          The only consession I made was that \r is a whitespace character, and it doesn’t cope with combining options like -lw or whatever. It would be reasonably trivial to add another function to pull the words out and produce format like the above (As long as the function did not contain any branches – which if it’s done correctly, it won’t), but it’s been quite a long while since I touched that code, so…

                          1. 4

                            For the optimized Go implementation:

                            To reduce the allocations, we’ll use a map[string]*int instead of map[string]int so we only have to allocate once per unique word, instead of for every increment

                            Huh? Can someone explain this to me? Frankly, this seems bananas.

                            1. 7

                              Well, let’s test it. Here’s a patch that switches to map[string]int:

                              diff --git a/optimized.go b/optimized.go
                              index 74b42c3..638b694 100644
                              --- a/optimized.go
                              +++ b/optimized.go
                              @@ -11,7 +11,7 @@ import (
                               func main() {
                                      offset := 0
                                      buf := make([]byte, 64*1024)
                              -       counts := make(map[string]*int)
                              +       counts := make(map[string]int)
                                      for {
                                              // Read input in 64KB blocks till EOF.
                                              n, err := os.Stdin.Read(buf[offset:])
                              @@ -71,7 +71,7 @@ func main() {
                              
                                      var ordered []Count
                                      for word, count := range counts {
                              -               ordered = append(ordered, Count{word, *count})
                              +               ordered = append(ordered, Count{word, count})
                                      }
                                      sort.Slice(ordered, func(i, j int) bool {
                                              return ordered[i].Count > ordered[j].Count
                              @@ -82,15 +82,8 @@ func main() {
                                      }
                               }
                              
                              -func increment(counts map[string]*int, word []byte) {
                              -       if p, ok := counts[string(word)]; ok {
                              -               // Word already in map, increment existing int via pointer.
                              -               *p++
                              -               return
                              -       }
                              -       // Word not in map, insert new int.
                              -       n := 1
                              -       counts[string(word)] = &n
                              +func increment(counts map[string]int, word []byte) {
                              +       counts[string(word)]++
                               }
                              
                               type Count struct {
                              

                              Before:

                              $ hyperfine './optimized-go < kjvbible_x10.txt'
                              Benchmark #1: ./optimized-go < kjvbible_x10.txt
                                Time (mean ± σ):     270.4 ms ±   7.6 ms    [User: 263.5 ms, System: 11.6 ms]
                                Range (min … max):   265.0 ms … 291.8 ms    11 runs
                              

                              After:

                              $ hyperfine './optimized-go-mapstringint < kjvbible_x10.txt'
                              Benchmark #1: ./optimized-go-mapstringint < kjvbible_x10.txt
                                Time (mean ± σ):     453.6 ms ±  20.5 ms    [User: 486.5 ms, System: 24.2 ms]
                                Range (min … max):   426.7 ms … 485.8 ms    10 runs
                              

                              Running both programs with perf shows that the program with map[string]int has a hotspot in runtime.mallocgc.

                              Presumably, the simpler map[someString] += 1 variant allocates a new string? Using a *int allows you just do a lookup and increment the value. I guess if you think of map[x] += 1 as map[x] = map[x] + 1 then maybe the extra alloc makes a bit more sense. But I wonder whether it’s possible for the compiler to optimize this kind of case. Would be neat.

                              1. 2

                                The take away for me is that map as a generic type is lying to me. It is not like a generic Dictionary<K,V> in .NET where a value type is stored as part of the dictionary, at least I think it is stored that way. It is possible I don’t understand that data structure either! Instead, it seems the map always uses pointers as values. It simply fetches those values and dereferences them for me.

                                1. 3

                                  How the Go runtime implements maps efficiently (without generics) gives some details on the implementation.

                                2. 2

                                  Looking at the profile in the original article, it’s plain that runtime.mapaccess2_faststr is about twice as fast as runtime.mapassign_faststr. So using an *int value seems to primarily avoid calling mapassign over mapaccess2 more than anything else.

                                  Running both programs with perf shows that the program with map[string]int has a hotspot in runtime.mallocgc.

                                  That’s what the original article indicates as well. I just don’t understand why that would be. That indicates mapassign allocates memory for every call, even when the key is found. But taking a cursory look at the source for mapassign_faststr, I don’t see any allocations in the existing key path. Do you still have stack traces for the runtime.mallocgc in the slower map assign version?

                                  1. 3

                                    Yes, I agree it is perplexing. Here’s a screenshot of my profile: https://imgur.com/a/EzzUULV

                                    Unfortunately, it looks like there aren’t any traces. Usually perf report shows them in my Rust programs, so maybe Go isn’t emitting the right dwarf info for it to work.

                              2. 3

                                I wonder if the awk result would improve if you use <input tr 'A-Z' 'a-z' | awk instead of using tolower. And whether parallelism comes into picture in this scenario.

                                Edit: Just checked, I get 0m0.921s for tr version compared to 0m0.954s for tolower. Not sure if parallelism had an effect here though.

                                1. 3

                                  I would have loved to see a Julia section.

                                  1. 1

                                    The unoptimised version would look a lot like the python version, but probably perform better if you omitted compilation time.

                                    You’d use a Dict, eachline, and split. The main performance issues would probably be unicode-aware lowercasing and hashing each word twice in the inner loop (once to check if it’s in the dict, once to set a new value. We might eventually get an in-place update interface for dicts to avoid this duplication, in the meantime you can use some internal funcs or use a third party package).

                                    Unlike the python version, if you need Moar Performance you can efficiently operate at the byte level and use any of the tricks from the C or Rust implementations.

                                    1. 1

                                      I suppose it will be dominated by startup time and IO. Although I am curious to see if the same issues would arise in Julia as well, splitting of words and slowdowns due to maps.

                                    2. 3

                                      This was inspiring and educational to see on a Monday morning. Thanks Ben for sharing!

                                      1. 2

                                        I definitely like what I’ve seen and heard about Rust, and I’d learn it over modern C++ any day, though I understand it’s got a fairly steep learning curve … and quite a few !?&|<> sigils.

                                        Here is Wikipedia’s definition of sigil, as used in computer programming:

                                        In computer programming, a sigil (/ˈsɪdʒəl/) is a symbol affixed to a variable name, showing the variable’s datatype or scope, usually a prefix, as in $foo, where $ is the sigil. -Wikipedia

                                        How many of the above !?&|<> are sigils, according to this definition, meaning variable names have symbols attached that change their datatype or scope?

                                        There isn’t one obvious place that documents this. Here are some relevant documentation and discussion that I’ve found so far:

                                        Based on the above, I only see two cases where Rust symbols act as sigils:

                                        • &, when used to denote a borrowed pointer type
                                        • ', to denote a lifetime variable

                                        I’m not completely confident that there are only two. There may be more than two; please correct me if I’m wrong.

                                        P.S. If you search for “Rust sigils” online, you may find this, but it is from a previous version of Rust. Rust now uses significantly different syntax.

                                        1. 16

                                          While technically correct (the best kind of correct) I think the author is probably using a less formal definition of sigil. The intent is clear, in any case: learning Rust means learning the semantics of a lot of ASCII characters.

                                          1. 2

                                            You mean technically incorrect, right?

                                            1. 2

                                              I think Peter’s reply can be unfortunately (and I suppose unintendedly) read in different ways; how I read it, and what I believe to be the intended meaning, was: “While [what you (@rele) wrote above is] technically correct […]”

                                              1. 1

                                                Yes, sorry, what akavel said: While you are technically correct.

                                              2. 1

                                                Still less than for J/K (or Perl!).

                                            2. 2

                                              Rosetta Code already has a task for this without the simplification with many more solutions.

                                              1. 1

                                                A barely relevant nitpick:

                                                Unfortunately, C doesn’t have a hash table data structure in its standard library. However, there is libc, which has the hcreate and hsearch hash table functions, so we’ll make a small exception and use those libc-but-not-stdlib functions.

                                                libc is the standard library of C. hcreate and hsearch are non-standard extensions in GNU libc a.k.a. glibc.

                                                Curiously though, OS X libc has hcreate and hsearch too. So I figure, this simplification (libc = glibc) is plenty justified.

                                                1. 3

                                                  Not to nitpick your nitpick, but here’s a comment that states that hcreate is part of POSIX:

                                                  https://news.ycombinator.com/item?id=26466477

                                                  1. 1

                                                    Why not just link to POSIX directly, after all ddg has !posix

                                                    https://pubs.opengroup.org/onlinepubs/9699919799/functions/hcreate.html

                                                2. 1

                                                  Can someone explain like I’m 5 how grep is so much faster than optimized C code? If the authors are aware of grep and its tricks, why couldn’t they just do similar things to achieve similar speeds?

                                                  1. 11

                                                    grep and wc are just signposts. They aren’t actually completing the challenge: https://github.com/benhoyt/countwords/blob/9d81d13711e56c25068893880a648ac96d0fa92e/benchmark.py#L23-L24

                                                    1. 8

                                                      Yeah, burntsushi is correct – they’re just being used as a baseline to show “how fast can a fast tool read this file” or “what’s the minimum it could be”. Sorry if that wasn’t clear. I’ll update the wording a bit to clarify.

                                                    2. 1

                                                      I very much enjoyed this read! In general I do wish posts like this had larger input sets to see the gaps potentially widen and reduce noise. Although in some cases this doesn’t help much.

                                                      Tangentially, I’m also curious how much affect knowledge of the dataset can come into play. I.e. when the author picked an initial hashet size of roughly 2x unique words. Since some hashmaps have different performance characteristics depending on how “full” the map is not to mention reducing number of allocations and the like… although I’m sure that’s miniscule here.

                                                      1. 3

                                                        Yeah data set size is tricky. I think in this benchmark, it is maybe just big enough. It’s a problem that appears when you include slower data points in your benchmark. e.g., Python is about an order of magnitude slower than C here. So if you jack that data set size up, all of a sudden measuring your benchmark with Python becomes annoying. Of course, you could solve this by using two inputs instead of one and partitioning the tools. But it’s more complex.

                                                      2. 0

                                                        I think this a wrong comparison. If I am not mistaken the unix version is actually parallel ? Atleast in golang the actual idiomatic way to do this would be to use goroutines, as the data is large. The beauty of goroutines and channels is that it enables pipe style thinking with functions.

                                                        1. 8

                                                          Coming up with a benchmark is hard. Benchmarks like these are models. All models are wrong. But. Some are useful.

                                                          Follow you thinking to its logical conclusion. If you open up the parallel programming flood gates, now you need to do that for all of the programs. Now your effort to contribute a sample for each language goes up. Maybe enough where actually doing the benchmark isn’t practical.

                                                          So let’s play the tape. If someone told me, “sure, have fun, throw threads at this!” I’d probably say, “can I use memory maps?” And let’s say, “yes, sure, go for it!” I say okay, here’s my first crack at it:

                                                          1. Hope stdin is a file and open it as a memory map.
                                                          2. Split the memory map slice into N chunks where N is the number of logical CPU cores on the system.
                                                          3. Run approximately the same algorithm devised in existing code submissions in N different threads, with one thread per chunk.
                                                          4. At the end, join the results and print.

                                                          If that ends up being the best strategy overall, then your model has become over-complicated because performance now likely depends on step (3). Which is exactly what the existing benchmark is. Now, maybe some languages have a harder time passing data over thread boundaries than others and maybe that impacts things. But that doesn’t apply to, say, C, Rust, C++ or Go.

                                                          If that’s not the best strategy, then maybe there is a cleverer approach that involves synchronizing on one shared data structure. I doubt it, but maybe. If so, your benchmark turns into one that is very different. It’s no longer just about data transformation, but now you have to worry about synchronization trickiness. And now you’re way outside of a dumb goroutine pool that you might have used in Go.

                                                          Building a good benchmark is an art. Sometimes the constraints look dumb and arbitrary. But you have to sit down and really look at it from a bunch of different angles:

                                                          • Do the constraints approximate some real world conditions?
                                                          • Do the constraints inhibit writing code that looks like real world code?
                                                          • Do the constraints inhibit optimizing the crap out of the code?
                                                          • Do the constraints make the benchmark difficult to create in the first place?
                                                          • Do the constraints permit easy and clear measurement?

                                                          There are probably more things that I didn’t think about.

                                                          1. 1

                                                            Nothing dumb about them. They are a core feature of the language. Writing goroutines is idiomatic in Go and possibly in Clojure unlike Rust, C++, Java, Julia, C, Common Lisp, Scheme, Haskell … which are still stuck in a bygone era of sequential first.

                                                            A more appropriate comparison would be with

                                                            1. Sequential Algorithms
                                                            2. Parallel Algorithms with threads, processes, green threads (the Unix approach is using processes)

                                                            Then we can also measure lines of code / code complexity / time taken to write the code and so on ….

                                                            The whole point of a controlled experiment in science is to avoid bias in results and data, not to compare grapes and oranges in sweetness.

                                                            1. 2

                                                              Nothing dumb about them.

                                                              OK… I wasn’t saying they were dumb as in “they’re stupid and useless.” I was describing the pattern of a goroutine pool as “dumb” to mean, “that thing that everyone has written to parallelize a trivially parallelizable problem.” It pretty much looks the same in Go, C, C++ and Rust and probably others. Go almost certainly has the least friction, but it’s not a huge deal in any of those languages.

                                                              The whole point of a controlled experiment in science is to avoid bias in results and data, not to compare grapes and oranges in sweetness.

                                                              If you can’t see that that is roughly what the OP is doing, then I don’t know what to tell you. If your beef is with the fact that the Unix solution is technically utilizing concurrency, then fine, just advocate removing it from the benchmark then. The Unix shell solution could be thrown out of the OP without losing much. The rest are all sequential AIUI.

                                                              1. 1

                                                                I was wondering how shell was beating golang. The author is using wc -w / grep as a baseline but confusingly placed it at the top, but the actual shell solution is actually at the bottom. The shell solution makes sense to be included from an loc perspective. I still think the idiomatic go solution should use goroutines and mirror the pipe solution seen in the shell solution.

                                                                1. 1

                                                                  That’s probably going to be slower, but just my 2 cents, you are free to implement that and prove me wrong.

                                                                  1. 1

                                                                    I probably read the post hastily but I was assuming the author tested with 1GB data and from a preliminary look it seemed that the shell was finishing first of all the things! But actually the shell finished last and the test data is just 4 MB, hence my confusion. I wrote the code just parallelizing lowerCase and uniqeCount, just like the shell logic and it seems to finish thrice as slow wrt the single threaded.

                                                                    file, 1MB ~ 100GB: single node with multiple threads, where you can mmap and parallel the reads/count, but has to take care of thread synchronization

                                                                    I was thinking more along the lines of reading by data by blocks or streaming data but not along the lines of mmap. Now I want to test the actual case! Probably some gene data should suffice for the actual real world use case.

                                                                    1. 2

                                                                      The test data is actually the 4MB file x 10, or about 43MB.

                                                            2. 1

                                                              Hope stdin is a file and open it as a memory map.

                                                              This properly won’t work on most common systems, the stdin is more like a pipe, and most os doesn’t expose the mmap interface on it. A practical version will be read stdio into slabs and distribute those to different threads. However as the author says there is no significant improvement for buffers over 1K bytes. Dispatching and synchronizing between threads is likely to introduce even higher overhead for those 1K bytes blocks. It’s very tricky to really optimize with parallel approaches, due to the problem (stdio) and small input size (102K).

                                                              Here are my predications based on the input type and input size:

                                                              • stdin, <1MB: single thread program optimized for instructions per bytes
                                                              • file, 1MB ~ 100GB: single node with multiple threads, where you can mmap and parallel the reads/count, but has to take care of thread synchronization
                                                              • file, >100GB: multi node clusters, something like hadoop might help since you have to store and share the input in some efficient way.

                                                              I think the OP is not really to show off any particular optimization technic, it’s more on language comparison, and showing a small but interesting coding puzzle to the audience.

                                                              1. 3

                                                                I mostly agree. I didn’t mean to actually get into the rabbit hole of parallelization. My main point was to illustrate why, even if you allowed multi-threaded programs in this benchmark, the single threaded case is still almost certainly worth optimizing and comparing directly because it will not only be a part of the multi-threaded solution but will likely dominate its runtime. I think that’s true in either of our scenarios here.

                                                                The higher-level point being, that, ceteris paribus, a simpler model benchmark is preferred to a more complex model benchmark.

                                                                This properly won’t work on most common systems, the stdin is more like a pipe, and most os doesn’t expose the mmap interface on it.

                                                                If you do prog < some-regular-file, then you can get the file descriptor and mmap some-regular-file on at least macOS and Linux.

                                                                But yes, obviously, I know it doesn’t work in all cases. That’s why I said, “hope stdin is a file.” :-) I suppose you want me to revise that to, “hope stdin is a file and that the OS supports memory maps and that you can access stdin’s as a file and that the OS supports memory mapping such files.” But that’s a lot more to type.

                                                            3. 2

                                                              If I am not mistaken the unix version is actually parallel ?

                                                              Only a little bit (much less than it might superficially appear). The easy automatic parallelism of shell pipelines is a very nice feature, but you have to keep its limitations in mind. sort, as a common example, is effectively a parallelization barrier, because it can’t produce any output until it’s consumed its entire input, so nothing downstream of it in the pipeline can do any meaningful work concurrently with anything upstream of it.