1. 13
  1. 1

    So Topfew distributes all that filtering and stream-editing out and runs it in parallel, since it’s all independent, and then pumps it over (mutexed, heavily bufffered) to the (necessarily) single thread that does the top-few counting.

    Is there any reason this can’t be a monoid so you’d also get parallel counting? That is, could you use the Reduce part of MapReduce to count in parallel?

    1. 1

      Cool to know that tf exists–I feel like data everywhere ought to be easier to query in more complicated ways and this seems like a contribution to that.

      An approach to cutting garbage when streaming in Go is to reuse your buffers and minimize copies. In this case, instead of calling ReadBytes, which allocates a byte slice, you might call Scanner.Bytes, which is, underneath, reading blocks into the same []byte buffer over and over and returning slices pointing to one line in that slice. Then you have to make sure you don’t retain slices across reads, and copies/allocations later in the processing chain can still bite you. I don’t know how it would have worked out in this context, and might be especially tricky to make work with the parallel stuff.

      Also, if tf doesn’t have much live heap data, Go will GC frequently to keep the heap small, when you might rather save some CPU by using more RAM. That’s a long discussed issue worked around with tricks or high GOGC.

      I’m not saying tbray should have done this stuff, nor am I trying to draw larger conclusions about the speed or goodness of Go or anything like that. Just noting this for any practical interest if you this kind of thing in Go.

      1. 1

        I wonder what the future performance optimization will look like. If we’re going to have 64- or 256-core CPUs in every toaster, then the focus will have to shift towards parallelizing work and fighting Amdahl’s law.

        Memory speed bottleneck today makes memory locality more important than instruction count, so clever data structures often lose to linear brute force. I guess the next stage is efficient sequential algorithms losing to just throwing data at as many cores as possible.