The author might be a bit confused about what “local variables” are in awk. Technically there are no local variables. Only function parameters are local to the function. All other variables are global. And function parameters are passed by value if scalar and by reference if array name. Thus there are no locally declared arrays.
Yeah, I agree with @apg – you seem to just be restating what the author is saying (I don’t think he’s confused at all). Also, I don’t think your last sentence is correct:
Thus there are no locally declared arrays.
You can definitely have a locally-declared array. From the POSIX spec:
If fewer arguments are supplied in a function call than are in the function definition, the extra parameters that are used in the function body as scalars shall evaluate to the uninitialized value until they are otherwise initialized, and the extra parameters that are used in the function body as arrays shall be treated as uninitialized arrays where each element evaluates to the uninitialized value until otherwise initialized.
So if you use a parameter inside a function as an array, but the caller doesn’t pass that array, AWK creates a fresh new local array each call (which works recursively too). It’s still stack-based, so still doesn’t need a GC. Consider this recursive function:
$ cat t.awk
BEGIN { f(2) }
function f(n, a) {
a[n] = 1
for (k in a) print k, a[k]
print "---"
if (n) f(n-1)
}
$ awk -f t.awk
2 1
---
1 1
---
0 1
---
Note how each time it’s called, the a
array only has one element. A fresh one is created each time == local arrays.
Compare that to this one, where we pass in a single global array from above, and note how the array gets larger as we add to it:
$ cat t2.awk
BEGIN { f(2, a) }
function f(n, a) {
a[n] = 1
for (k in a) print k, a[k]
print "---"
if (n) f(n-1, a)
}
$ awk -f t2.awk
2 1
---
1 1
2 1
---
0 1
1 1
2 1
---
Only function parameters are local to the function
I’m now confused. How is what you are saying different from what the author says?
I think there’s a mistake in the first sentence: 0xFFFFFFFFFFFFFFFF (2^64-1) is actually 18,446,744,073,709,551,615, not 9,223,372,036,854,775,807 (2^63-1). The former is the largest unsigned 64-bit integer (which is what he’s talking about here), and the latter the largest signed 64-bit integer.
>>> format(0xFFFFFFFFFFFFFFFF, ',')
'18,446,744,073,709,551,615'
My nmemonic to remember 2**64 is “18 times ten to the 18” (18e18).
Very cool! I peeked at the code, and it looks like a big part of it was adding location info to AST nodes?
So I would guest that goawk is an AST interpreter? (like the original awk, but unlike mawk I think)
Is there a runtime cost to coverage mode?
I wouldn’t say a “big” part of it was adding location info to AST nodes, but yeah, those are needed so that the “coverage injector” can record map to line/column numbers in the report. Basically GoAWK has to execute the __COVER
assignments in the coverage-injected code, for example below:
$ goawk -f prog.awk -covermode set -d
BEGIN {
__COVER["3"] = 1
print "will always run"
if ((1 + 1) == 2) {
__COVER["1"] = 1
print "should run"
} else {
__COVER["2"] = 1
print "won't run"
}
}
GoAWK started life as a tree-walking interpreter, but I switched to a byte-code compiler + interpreter a while back. There’s still a runtime cost to coverage as it has to record lines covered in a map, but it’s not on by default, so hopefully not too much of an issue for users.
Kind of a tangent, but I’m curious: when it was an AST interpreter, how did you implement break / continue / return in Go?
I do it with exceptions, but Go doesn’t have those. You can just use the normal style of return result, err
, or I have seen people polling a global flag. Maybe you can use panic/recover, though I think that would be frowned upon
When the interpreter walked the AST I used error returns for this: I just had the BreakStmt
return a user-defined error errBreak
, and then all the loops would check for this error and break, for example WhileStmt. Now that it’s bytecode-compiled I still use error returns, though I can be a bit smarter about turning breaks and continues into jump opcodes most of the time.
Originally in the AST-walking interpreter I had used panic/recover for this, mainly because it simplified evaluation, but it was significantly faster to use error returns (and you’re right, not idiomatic Go), so I switched to that early on.
The optimized Go implementation will count words twice when they cross the buffer boundary. Which is a bit of a shame, as the Go standard library provides not one but two(!) interfaces for exactly this task: bufio.ReadString
and bufio.Scanner
.
My other observation would be that 413MB isn’t ‘large’ in a meaningful sense. It’s small enough to read completely into memory on an early-2000s laptop. It’s small enough to comfortably fit in the DRAM cache of a consumer-grade SSD. Which doesn’t mean that this article’s intuition is completely wrong, but that the ratio might change considerably if datasets that are actually large are used.
Can you show me (or give a test case) for how it counts words twice when they cross the buffer boundary? I’ve double-checked the code and run various tests with a reduced buffer size, and it doesn’t seem to do any double-counting. I used bufio.Scanner
in the original simple Go version, but it’s hard to separate out read time from processing time, because the scanner does both, so for the simple version in this article I’m just reading the whole file. For the optimized version, reading large chunks and then scanning the bytes by hand is faster, because it avoids allocations.
You’re probably right that I should have used a larger dataset. Still, I think it proves the main point of the article, that disk I/O isn’t the bottleneck for processing like this. I don’t think going up 10x or 100x would have changed that.
This is an interesting observation and I think the message is valid, but I don’t think the benchmark results paint an accurate picture. The author is completely utilizing the computer’s disk I/O bandwidth, but the CPU cores are only utilised at (I’ll take a guess) 12.5%, assuming 8 hyper threads available. I think this distinction is particularly important in the big data case mentioned at the end.
It may be far worse. As Python has no “@associative” built in to counter, it is probably running on a single core. Said core handling overhead of data movement as well.
That said, it is a guess. My humble view is that we are no longer capable of optimizing without careful analysis of actual profiling data. Our guesses and logic are no match for chip pipeline and thread oddness.
Yes, there are definitely straight-forward ways to parallelize this particular problem. That’s part of the reason I put “big data” in quotes. Often data just isn’t “big” enough to worry about it – if I can get the answer I want out of a 400MB or even 4GB input file in a few seconds, I’ll probably stop there. However, if I have to write a more general tool that I’ll use over and over, I’d parallelizing it if it was slow (for some value of slow).
Maybe it relates to the international nature of Internet communication/English as a second language effects, but a lot of readers seem to not grasp the use of scare quotes for “non-standard” or “difficult to interpret” or “context dependent/no standard even exists, really”. FWIW, I also love them as a pithy “author understands interpretational risk” signal. I don’t have any better answer than more elaborate text – like “some value of”/“in context”/etc. qualifier constructions, but the resulting more complex grammar can be a problem for the very same readers.
This is all especially an issue in performance of anything where almost everything “all depends” upon, well, quite a lot. E.g., these days (post Nehalem Intel/L3) DIMM reads themselves depend quite a lot on whether done in parallel or serial. So, I have an NVMe that goes 7 GiB/s while your DIMM estimate was 10 G, but I’d bet your parallel DIMM reads are 25..30. Big “scalable” Intel multi-core could see a range of from 8 to 50 GiB/s (or more, even a whole order of magnitude). Personally, it annoys me that I am sometimes forced to go parallel (not necessarily multi-threaded) with all its attendant complexity on an otherwise very serial problem just to saturate a serial memory bus (yes, I know NUMA can be a thing). As this paragraph maybe shows, qualifying everything gets indigestible fast. ;-)
Anyway, @enobayram, to add a little more detail on your point - this is one way to parallelize the problem in Nim and note that scaling up by just doing 10X something like the KJ bible keeps hash tables artificially small relative to natural language vocabulary use which matters since fitting each vocabulary-scaled histogram in the private (i.e. maybe non-hyper threaded) L2 can make a big perf difference at some data scales. Whether hyper-threaded CPU resource sharing helps also “all depends”. At least on Linux, something like the taskset
command might aid in reasoning about that.
Personal: reviewing a new code coverage feature in GoAWK, submitted by a (power) user. https://github.com/benhoyt/goawk/pull/154
A clear write-up of a straightforward method for creating and hosting a personal website. Thanks.
I was surprised to see the doctype (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
), but I guess that falls into the category of It it’s not broken, don’t fix it.
For anyone who wants to learn the basics of HTML and CSS, I recommend Interneting is Hard. (In fact, I’ll probably submit that now separately too.)
I guess that falls into the category of It it’s not broken, don’t fix it.
This is true! It’s been that way forever and just works. But your comment made me think, “I should probably update it now.” I’ll give it a try and see if anything breaks. :-)
Confirmed: Rob Pike reads Eli’s blog. https://github.com/golang/go/issues/53261
Quite possible, though it’s equally likely he saw the related Go change that Eli submitted: https://go-review.googlesource.com/c/go/+/410296
Matter in what sense? It’s just amusing that Eli posted a blog entry about improving performance and then the Go team member responsible posted an issue proposing to do the same thing. The issue doesn’t mention the blog post, but the timing is too close for it to be coincidence. Alas, Ben Hoyt thinks the issue was in response to a change request, and not the blog per se.
I find “should” in test names to be a noise word that can be easily eliminated.
Taking some of the examples from the article:
should apply discount when total cost exceeds 100 dollars
applies discount when total cost exceeds 100 dollars
should create record for valid input
creates record for valid input
should return error code 1 for unknown error
returns error code 1 for unknown error
I was basically about to sag the same thing.i want to add though that I’ve seen this “should” approach greatly help people manage the mental switch from the test_functionname pattern, but yeah, once you feel okay take off the training wheels
This all just feels like a ton of unnecessary noise to me. I actually do test “should return error code 1 for unknown error” right here: https://github.com/carlmjohnson/exitcode/blob/master/code_test.go#L21 The test is called TestGet_default. I think if it were somehow confusing “why is it checking for 1? what does 1 mean here?” it might be worth having a comment or a longer test name, but in most cases what the test is testing for should be obvious and if it’s not (like the discount example), then that means the API in the code probably isn’t as clear as it should be.
Agreed. I’m not a fan of verbose test names like this, especially when it’s done using one of those cutesy DSLs, like it().should().be().equalToOrGreater().than(42)
. Just call it TestRange
or TestMin
or whatever.
Nit: the test above is actually called TestGet/default
. See go test -v ./...
output:
--- PASS: TestGet (0.00s)
--- PASS: TestGet/wrapped (0.00s)
--- PASS: TestGet/nil (0.00s)
--- PASS: TestGet/default (0.00s)
--- PASS: TestGet/help (0.00s)
--- PASS: TestGet/set (0.00s)
Good point! I should’ve mentioned that I’ve mostly used test frameworks which require a certain naming convention to detect tests, such as Pytest’s default test_
prefix, which can be overridden but (AFAIK) not avoided completely.
It definitely does depend on the testing framework you’re using!
As another reply pointed out, this works best when you’re using a DSL where you setup tests similar to this:
it('does the right thing', () => {});
This looks good, and has a nice small API surface. A couple of thoughts:
.gitignore
:-). No fluff, no subdirectories.^
and $
anchors in the regex matching should be implicit – I think people will forget them. Also, when would you not want to match the whole thing for routing? (And in the rare cases you do, just add a .*
.)Previous comments: https://lobste.rs/s/lvd3wz/adding_csv_support_go_awk
I haven’t yet spent much time on performance. I intend to profile it properly at some point, but for now, it’s “good enough”: on a par with using Go’s encoding/csv package directly, and much faster than the csv module in Python. It can process a gigabyte of complex CSV input in under five seconds on my laptop.
When did Go get faster than Python at CSV parsing? For a long time, Python was faster.
I think I remember you saying that on another thread, and I was surprised (which is part of the reason I benchmarked against Python). I realize most of the good stuff in Python’s csv
module is implemented in C, so I guess I shouldn’t be surprised either way. My guess is Python does more memory allocations, or the dynamic typing / object overhead is significant.
In any case, the numbers in my benchmarks speak for themselves: Python takes 3x as long for both input and output. Feel free to try my benchmarks yourself – you’ll need a huge.csv
file, as I haven’t checked it into the repo (or I can send you a link to the one I’m using if you want).
I’d love to see numbers where Python was faster. It’s possible that was “back in the day” when Go produced executables 5-10x as slow as they are today: https://benhoyt.com/writings/go-version-performance/
Thanks! For what it’s worth, below is the best-of-3 time in seconds of Go’s encoding/csv.Reader
using this script on a 200GB, 100,000-row CSV file with 286 columns:
Go version,Time (s)
1.05,5.60
1.06,5.36
1.07,5.27
1.08,4.61
1.09,3.74
1.10,1.00
1.11,1.02
1.12,1.00
1.13,0.93
1.14,0.95
1.15,0.94
1.16,0.96
1.17,0.96
1.18,0.91
The Go version that issue reporter was using was 1.7. It gets significantly faster in Go 1.8 and 1.9, and then way faster in Go 1.10. I think it’s mainly due to CL 72150, but I’m not sure. In any case, between encoding/csv
and Go compiler improvements, Go’s CSV reading is 5-6 times as fast now as it was then.
Wow! That’s pretty remarkable improvement. It might be worth a blog post by itself. :-)
The other traditionally “slow” package in Go was regex. I remember my friend’s company was using a C-wrapper library to avoid the slow Go regex package. I wonder what would happen if they retested it now.
Feel free to write that – I’m blog-posted out for a bit. :-)
Unfortunately regexp
is still slow, though I did see a couple of CLs recently that might lead to some improvements, for example https://go-review.googlesource.com/c/go/+/355789
I think there has been a bad trend of calling “rfc4180 quoted CSV” just “CSV” as if that is the “one true meaning”. Unquoted CSV is not “improper” if the data does not require more. RFC1480 is pretty clear it’s not some kind of “official standard” (like some other RFCs), but just trying to wrangle one description from the chaos. When the ‘S’ actually means what it stands for, regular awk -F
can work fine as can the term “CSV”. Given the trend (& chaos), the way to go now may be to just always qualify…quoted CSV, rfc1480 CSV, unquoted CSV, Excel CSV, etc.
More substantively, in terms of benchmarking, the pipeline approach with csvquote as the article mentions (or c2tsv) may be less convenient, but it may also allow you to get better wall clock with multi-core (depending on..lots). If you have space to save a processed copy then you can avoid re-doing the extra work all the time. Also, once it is in a well-delimited random access file, you can do non-indexed parallel processing on a big file just by using random access to skip to the next newline like this.
There may be an awk optimization to be had to check if stdin is mmapable to automate this. You’re basically on the road to a map-reduce mechanism wired into awk..somehow, probably with some built-in aggregator/reducers functions. And, of course, disk head seek self-competition can be a problem, but flash storage (or big enough /dev/shm
even) is really common these days. As are lots of cores on laptops and better memory bandwidth to multiple cores than any single core.
I’m not quite sure, but your first paragraph seems to be saying that RFC 4180 requires fields to be quoted, but the RFC definitely allows both “escaped” and “unescaped” fields (to use the wording of the RFC grammar). I agree there are CSV files that aren’t parseable by RFC 4180 rules, but these days they’re few and far between. In addition, GoAWK’s code uses the (effect of the) encoding/csv.Reader
LazyQuotes=true and FieldsPerRecord=-1 options, making it a bit more forgiving.
Interesting points about csvquote and multicore – I hadn’t considered that. (Though eking out the last bit of performance is not my biggest concern at the moment. For GoAWK’s CSV support, I wanted to get the semantics “right”, and I’ll focus on performance later.)
Apologies for any unclarity. I am saying since most software lets you change the 2 delims, “CSV/TSV” terms are muddy. I get the need to refer to the specific. I even called my program “c2tsv”. So, I’m aware of all that, but I do not think there is “one true CSV”, except for the abstract umbrella term suggesting “mostly” split-parseable data with scare quotes to be clarified. Pretending there is a true “CSV” confuses more than helps conversations/innovation.
Being “forgiving” is part of the problem, which is where my later text came from. It is actually the optionality of quoting combined with its non-locality which causes trouble with unindexed segmentation. With well delimited rows and random access files, you can just jump to 1/nCPU into the stat
-knowable bytes and scan for a row delimiter (as done in my link). With optional quotes you may have to scan back to the beginning and this may even be the common case, e.g. if there is no quote in whole files.
You can build a [rowNumber, byteNumber] index, but then that touches all data. There may be “most of the time for me & my data” workarounds, but with “delimiter-escaped soundly split parseable CSV” (or maybe DECSV for short) there is no problem (other than bigness coming from many rows not a few giant ones and statistical regularity).
csvquote just moves delims to “unlikely” chars. So, you’d be in trouble with JPG fields (but many Unix utils would choke on NUL
bytes anyway…So, maybe out of scope).
I suspect “C vs. DOS/Win FS” backslash wars or “Excel job security” may have inhibited DECSV adoption. You can pick a less colliding escape, but many end-user subpopulations are more likely to interpret "\n"
in fields correctly than, say, "^An"
. I’ve not read through your format war back-links, but I’d expect these ideas arose there. Sorry, I did not intend to rehash all that, but it’s all related.
Whenever there is a complex format like rfc4180 CSV with a simpler to work with alternative like DECSV, it seems better to have a lib/program convert to the simple. Besides parallelism from a pipeline|segmentation, one could hope that it may someday become what is used more commonly without conversion and easier parsing by programmer-users without bugs/test suites/etc.
Kind of related, I’ve been meaning to do some basic experiments/benchmarks to understand when handwritten lexing is faster than regexps (for example, comparing some typical [not necessary complete] email regexp matcher with a handwritten matcher that matches the same strings).
At the highest level I assume that kind of quickly a lexer will be faster than most regexps since my lexer would be a precise function and regexp is a general purpose engine.
But I’m sure there will be surprises.
Yes, that’s almost certainly the case, especially given that Go’s regexp
package is quite slow (guaranteed linear-time, but slow): https://github.com/golang/go/issues/26623
FWIW you can also generate code from regexes – i.e. it can be a regex compiler rather than an interpreter.
For example Oil’s lexer uses re2c to generate C code.
intermediate source: https://www.oilshell.org/release/0.10.0/source-code.wwz/_devbuild/tmp/osh-lex.re2c.h
generated code: https://www.oilshell.org/release/0.10.0/source-code.wwz/_devbuild/gen/osh-lex.h
The generated code is a big state machine with switch and goto, and it doesn’t allocate memory or anything like that. It’s quite fast in my experience.
I meant to write a blog post on re2c, but haven’t gotten around to it. Some pictures here:
https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks
Another benefit of regular languages is that you get (simulated) nondeterminism FOR FREE. It’s still a linear pass, but it can be in multiple states at once, and then make the decision much later.
So it can be easier to write certain kinds of lexers, and even move work from the parser to the lexer. I wrote about related topics here but didn’t quite emphasize the nondeterminism:
This was a really good read from the horses’ mouths. It definitely fits my experience: regardless of the language itself, the short build times, good standard library, great tooling, gofmt, approach to dependencies, ease of cross-compilation, “just copy the binary” deployments … this tooling is really a big part of Go’s appeal.
That said, I think they’re underselling the language itself a bit. I love Go’s unique approach to interfaces (“static duck typing”) and simple type system (I think of it as the anti-Scala). The simple language with few keywords and a short, readable spec. Reduced boilerplate with no explicit public
keyword, the :=
declare-and-assign syntax, and syntax for things like string slicing and map lookup/assignment. I do think error handling boilerplate could be improved, but at least it’s simple and explicit.
I’m somewhat surprised the article says that “we left any form of generics out of the original language”, because Go has always had a “form of generics” with its builtin map and slice and channel types: map[K]V
, []T
, chan T
. Go always had generic collections, you just couldn’t define your own.
I love Go’s unique approach to interfaces
I, too, like structural typing, but it isn’t unique to Go.
I really like how Go’s Duration time handles this, letting us do time.Sleep(5 * time.Second)
Unfortunately Duration
is not a type, but an alias for an integer, so this mistake compiles:
time.Sleep(5);
Your point stands about the mistake, but just to clarify the terminology: Duration
is a defined type, not an alias (alias is specific Go terminology which means it behaves exactly the same as that type). The reason this mistake compiles is because literals in Go are “untyped constants” and are automatically converted to the defined type. However, these will fail, because s
and t
take on the concrete type int
when they’re defined:
var s int
s = 5
time.Sleep(s)
t := 5
time.Sleep(t)
The thing I dislike the most about Go’s Duration type is that you can’t multiply an int by a Duration:
To convert an integer number of units to a Duration, multiply:
seconds := 10 fmt.Print(time.Duration(seconds)*time.Second) // prints 10s
In the example above, the intent is somewhat clear due to the seconds
variable name, but if you just want to have something like this:
some_multiplier := ...
delay := some_multiplier * (1 * time.Second) // won't work
You have to convert some_multuplier
to time.Duration
, which doesn’t make sense!
Go doesn’t allow for operator overloading, which I’m kind of okay with. It tends to add complexity for (what I personally consider to be) little benefit.
On the other hand, this is the kind of case that really makes the argument for operator overloading. Having to use a bunch of alternate specific-to-the-type function implementations to do common operations gets tiresome pretty quickly.
So Go has different operators for adding floats and for adding integers? I have seen that in some languages, but it’s nevertheless quite unusual. OTOH, I can see that it reduces complexity.
Go has built-in overloads for operators, but user code can’t make new ones.
It’s similar to maps (especially pre-1.18) that are generic, but user code is unable to make another type like map.
I agree it is annoying. Would a ‘fix’ be to alter/improve the type inference (assuming that some_multiplier is only used for this purpose in the function) so that it prefers time.Duration to int for the type inferred in the assignment?
I’m not sure it would be an incompatible change - I think it would just make some incorrect programs correct. Even if it was incompatible, maybe go2?
While I do think Go could do more to work with underlying types instead of declared types (time.Duration
is really just an alias for int64
, as a time.Duration
is just a count of nanoseconds), it does make sense to me to get types to align if I want to do arithmetic with them.
My perfect world would be if you have an arbitrary variable with the same underlying type as another type, you could have them interact without typecasting. So
var multiplier int64 = 10
delay := multiplier * time.Second
would be valid Go. I get why this will probably never happen, but it would be nice.
That’s how C-based languages have worked, and it’s a disaster. No one can keep track of the conversion rules. See https://go.dev/blog/constants for a better solution.
If you define some_multiplier
as a constant, it will be an untyped number, so it will work as a multiplier. Constants are deliberately exempted from the Go type system.
I’ve never actively used Go, but is the compiler able to infer types here ?
uniqValues := lo.UniqBy[int, int]([]int{0, 1, 2, 3, 4, 5}, func (i int) int {
return i%3 // ^^^^^^^^^^ here
})
I might expect such expressions to be more generally written like :
uniqValues := lo.UniqBy(slice, fn)
Which feels much better to my eyes
There is some inference that can be done by the Go compiler to elide this. Perhaps the author of the library just likes the verbosity?
Interesting enough (after playing around with Go 1.18), there are already linters that will warn about unnecessary type annotations that can be inferred.
there are already linters that will warn about unnecessary type annotations that can be inferred
What linters? I’m curious to know which linters, etc. have been updated for 1.18. The last time I checked, e.g., goimports
still didn’t like generics at all.
So, I just recompiled all the tools that VSCode (+ the Go extension) comes with using go1.18beta2, and when you do that, VSCode comes with some linter called infertypeargs
that will do this. I just tried it out again and it is indeed working.
The road is still a tad bumpy for IDE/tooling integration with Go generics still, but I imagine it’ll smooth out in no time.
Just a note: this comes from gopls – it’s not a standalone linter.
Yeah, Go 1.18 infers this just fine, see https://go.dev/play/p/GSfVy3Xm4fV?v=gotip – I guess the author wanted to be explicit in the examples. I’m not sure why, as type inference is one of the really nice things about generics.
I prefer just putting the data directly into the code:
This has been a source of frustration working with C# recently, since all the examples assume you want to put your data in an external XML file.
That works, but when you have dozens of hundreds of tests, each with a couple of pages of source code or text (or even binary data) in them, it becomes unwieldy to include them all directly in the code. As Eli says, “the input to a test can be more data than we’re willing to paste into a _test.go
file, or the input can be non-text”.
If it’s binary files, then I can see wanting to use external files. For anything else, we already have lots of tools for handling large amounts of source code. Why build new ones for a new format?
But I also really often have functions as data to my tests, and that you can’t do sensibly with external files.
This is an amazing feat. But shouldn’t the post title be “SectorC: A C Compiler in 512 bytes”?