All of these articles are frustrating because they use different environments and test sets and none of the ones I’ve read have posted the test sets up. Some people use random characters, some people use existing files. Some people use files of 1 MiB, some 100 MiB, some several GiB in size. Not only that, but the people programming the replacements don’t even normalize for the difference in machine/processor capability by compiling the competitors and GNU wc from scratch. The system wc is likely to be compiled differently depending on your machine. The multithreaded implementations are going to perform differently depending on if you’re running Chrome when you test the app or not, etc.
This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.
I tested this on a fresh install of Fedora 31, so I didn’t really see any benefit of running it on a LiveUSB. As I mentioned in the article, the wc implementation I used for comparison has been compiled locally with gcc 9.2.1 and -O3 optimizations. I’ve also listed my exact system specifications there. I’ve used the unprocessed enwik9 dataset (Wikipedia dump), truncated to 100 MB and 1 GB.
I understand your frustrations with the previous posts, but I’ve tried to make my article as unambiguous as possible. Do give it a read, if you have any futher suggestions or comments, I’d be happy to hear them!
These posts are all pointless because wc doesn’t represent high performance C code. If anyone cared about optimizing wc, it would use SIMD extensions like AVX to count multiple chars per cycle and trash all these blog posts (edit: apparently someone did, see this post on lobste.rs).
The real take away: all these languages are fast enough for general purpose use, because they beat a C program that everyone considers fast enough for general purpose use.
So they’re not pointless. The value in theses posts (this one included) is that they describe how to solve and optimize a problem in various languages.
Most fail to mention the test setup’s locale as well! GNU’s version of wc, at least, uses the definition of “whitespace” from your current locale when counting words, while the linked Go implementation hard-codes whitespace as “ \t\n\r\v\f”. Whether this impacts speed and/or correctness depends on your locale.
As mentioned in the article, the test files are us-ascii encoded. I’m comparing with the OS X implementation, not GNU, and I have used the same definition of whitespace as them. I didn’t mention this in the post for the sake of brevity.
Well, unless you remove the multi-character white space function call (iswsspace) from the C version and replace it by an if/case statement as in the Go version. Then the C version is faster than the Go version, though by a small margin (they probably compile to similar machine code):
I will paste my compiled comments from the orange site (which were not so coherent, since I was typing them on my phone). The problem is that the article compares apples to oranges, the Darwin version of wc calls a multi-byte character function taking up a significant amount of time, whereas the Go version does not do anything related to multi-byte characters.
Take the Darwin version linked from the article. Run perf record wc thefile.txt. Then run perf report and you will see iswspace in the call graph. This function tests for white space in wide characters.
$ time ./wc ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc ../wiki-large.txt 0.47s user 0.02s system 99% cpu 0.490 total
$ time ./wc2 ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc2 ../wiki-large.txt 0.28s user 0.01s system 99% cpu 0.293 total
Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:
$ time ./wc3 ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wc3 ../wiki-large.txt 0.25s user 0.01s system 99% cpu 0.267 total
If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72). And why would the C version be ~2x slower than the non-parallelized Go version? They would basically compile to the same small state machine in machine code.
Edit: I took the Go version from the webpage and ran that on the same data as well:
$ time ./wcgo ../wiki-large.txt
854100 17794000 105322200 ../wiki-large.txt
./wcgo ../wiki-large.txt 0.32s user 0.02s system 100% cpu 0.333 total
So, the C version is indeed faster when removing this piece of multi-byte character handling.
This is a lot clearer, thank you! I ran your benchmarks and this is what I got on my machine:
100 MB: 0.33 s, 2032kB
1 GB: 3.23 s, 2032 KB
This is indeed slightly faster than the non-parallelized Go version, although it still uses more memory! It does seem strange that it has been written this way.
Also, I have updated the post to remove the dependency on fmt and to stop manually setting GOMAXPROCS, which has improved performance and memory consumption significantly. You should check it out, I think you’ll find it interesting.
While still acknowledging that it’s a slightly rolled-out state machine, the code is something I think even a high-schooler could be talked through and expected to maintain. I know that Go’s whole schtick is exactly this sort of work, but I’m nonetheless impressed.
All of these articles are frustrating because they use different environments and test sets and none of the ones I’ve read have posted the test sets up. Some people use random characters, some people use existing files. Some people use files of 1 MiB, some 100 MiB, some several GiB in size. Not only that, but the people programming the replacements don’t even normalize for the difference in machine/processor capability by compiling the competitors and GNU
wc
from scratch. The systemwc
is likely to be compiled differently depending on your machine. The multithreaded implementations are going to perform differently depending on if you’re running Chrome when you test the app or not, etc.This would easily be solved by using the same distribution as a live USB, sharing testing sets, and compiling things from scratch with predefined options, but nobody seems to want to go to that much effort to get coherent comparisons.
I tested this on a fresh install of Fedora 31, so I didn’t really see any benefit of running it on a LiveUSB. As I mentioned in the article, the wc implementation I used for comparison has been compiled locally with
gcc 9.2.1
and-O3
optimizations. I’ve also listed my exact system specifications there. I’ve used the unprocessed enwik9 dataset (Wikipedia dump), truncated to 100 MB and 1 GB.I understand your frustrations with the previous posts, but I’ve tried to make my article as unambiguous as possible. Do give it a read, if you have any futher suggestions or comments, I’d be happy to hear them!
These posts are all pointless because
wc
doesn’t represent high performance C code. If anyone cared about optimizingwc
, it would use SIMD extensions like AVX to count multiple chars per cycle and trash all these blog posts (edit: apparently someone did, see this post on lobste.rs).The real take away: all these languages are fast enough for general purpose use, because they beat a C program that everyone considers fast enough for general purpose use.
Someone did write a C version with SIMD. They got a ~100x speedup over
wc
.So they’re not pointless. The value in theses posts (this one included) is that they describe how to solve and optimize a problem in various languages.
Most fail to mention the test setup’s locale as well! GNU’s version of wc, at least, uses the definition of “whitespace” from your current locale when counting words, while the linked Go implementation hard-codes whitespace as “ \t\n\r\v\f”. Whether this impacts speed and/or correctness depends on your locale.
As mentioned in the article, the test files are us-ascii encoded. I’m comparing with the OS X implementation, not GNU, and I have used the same definition of whitespace as them. I didn’t mention this in the post for the sake of brevity.
Well, unless you remove the multi-character white space function call (
iswsspace
) from the C version and replace it by an if/case statement as in the Go version. Then the C version is faster than the Go version, though by a small margin (they probably compile to similar machine code):https://lobste.rs/s/urrnz6/beating_c_with_70_lines_go#c_flic9z
Still, the parallelization done in Go is nice!
I will paste my compiled comments from the orange site (which were not so coherent, since I was typing them on my phone). The problem is that the article compares apples to oranges, the Darwin version of wc calls a multi-byte character function taking up a significant amount of time, whereas the Go version does not do anything related to multi-byte characters.
Take the Darwin version linked from the article. Run
perf record wc thefile.txt
. Then runperf report
and you will seeiswspace
in the call graph. This function tests for white space in wide characters.Replace the line
by
And I get a ~1.7x speedup:
Remove unnecessary branching introduced my multi-character handling [1]. This actually resembles the Go code pretty closely. We get a speedup of 1.8x.:
If we take the second table from the article and divide the C result (5.56) by 1.8, the C performance would be ~3.09, which is faster than the Go version (3.72). And why would the C version be ~2x slower than the non-parallelized Go version? They would basically compile to the same small state machine in machine code.
Edit: I took the Go version from the webpage and ran that on the same data as well:
So, the C version is indeed faster when removing this piece of multi-byte character handling.
[1] https://gist.github.com/danieldk/f8cdaed4ba255fb2954ded50dd2931ed
This is a lot clearer, thank you! I ran your benchmarks and this is what I got on my machine:
100 MB: 0.33 s, 2032kB
1 GB: 3.23 s, 2032 KB
This is indeed slightly faster than the non-parallelized Go version, although it still uses more memory! It does seem strange that it has been written this way.
Also, I have updated the post to remove the dependency on
fmt
and to stop manually settingGOMAXPROCS
, which has improved performance and memory consumption significantly. You should check it out, I think you’ll find it interesting.Additionally, I made the source code available here: https://github.com/ajeetdsouza/blog-wc-go
Setting aside all this other stuff, can I just say how impressed I am that the “naive” implementation, clear and clean as it is, is within an order of magnitude of the C version? Sure, it may not have the rigorous error checking of the Ada version, and it might be longer than the first cut of the Rust version that has absurd memory usage, but like wow.
While still acknowledging that it’s a slightly rolled-out state machine, the code is something I think even a high-schooler could be talked through and expected to maintain. I know that Go’s whole schtick is exactly this sort of work, but I’m nonetheless impressed.
[Comment removed by author]