I am working on a project to generate 1 billion rows in SQLite under a minute and inserted 100M rows inserts in 33 seconds. First, I generate the rows and insert them in an in-memory database, then flush them to the disk at the end. To flush it to disk it takes only 2 seconds, so 99% of the time is being spent generating and adding rows to the in-memory B Tree.
For Python optimisation, have you tried PyPy? I ran my same code (zero changes) using PyPy, and I got 3.5x better speed.
An easy way to do this might be to just create the DB in /dev/shm (on Linux anyway) and later copy it to a persistent device (if you have enough DIMM space).
Sending PRAGMA synchronous=off to SQLite or using libeatmydata might be easier; either way your goal is to skip fsync / fdatasync calls (since this data could just be re-generated anyway).
This is an interesting observation and I think the message is valid, but I don’t think the benchmark results paint an accurate picture. The author is completely utilizing the computer’s disk I/O bandwidth, but the CPU cores are only utilised at (I’ll take a guess) 12.5%, assuming 8 hyper threads available. I think this distinction is particularly important in the big data case mentioned at the end.
It may be far worse. As Python has no “@associative” built in to counter, it is probably running on a single core. Said core handling overhead of data movement as well.
That said, it is a guess. My humble view is that we are no longer capable of optimizing without careful analysis of actual profiling data. Our guesses and logic are no match for chip pipeline and thread oddness.
Yes, there are definitely straight-forward ways to parallelize this particular problem. That’s part of the reason I put “big data” in quotes. Often data just isn’t “big” enough to worry about it – if I can get the answer I want out of a 400MB or even 4GB input file in a few seconds, I’ll probably stop there. However, if I have to write a more general tool that I’ll use over and over, I’d parallelizing it if it was slow (for some value of slow).
Maybe it relates to the international nature of Internet communication/English as a second language effects, but a lot of readers seem to not grasp the use of scare quotes for “non-standard” or “difficult to interpret” or “context dependent/no standard even exists, really”. FWIW, I also love them as a pithy “author understands interpretational risk” signal. I don’t have any better answer than more elaborate text – like “some value of”/“in context”/etc. qualifier constructions, but the resulting more complex grammar can be a problem for the very same readers.
This is all especially an issue in performance of anything where almost everything “all depends” upon, well, quite a lot. E.g., these days (post Nehalem Intel/L3) DIMM reads themselves depend quite a lot on whether done in parallel or serial. So, I have an NVMe that goes 7 GiB/s while your DIMM estimate was 10 G, but I’d bet your parallel DIMM reads are 25..30. Big “scalable” Intel multi-core could see a range of from 8 to 50 GiB/s (or more, even a whole order of magnitude). Personally, it annoys me that I am sometimes forced to go parallel (not necessarily multi-threaded) with all its attendant complexity on an otherwise very serial problem just to saturate a serial memory bus (yes, I know NUMA can be a thing). As this paragraph maybe shows, qualifying everything gets indigestible fast. ;-)
Anyway, @enobayram, to add a little more detail on your point - this is one way to parallelize the problem in Nim and note that scaling up by just doing 10X something like the KJ bible keeps hash tables artificially small relative to natural language vocabulary use which matters since fitting each vocabulary-scaled histogram in the private (i.e. maybe non-hyper threaded) L2 can make a big perf difference at some data scales. Whether hyper-threaded CPU resource sharing helps also “all depends”. At least on Linux, something like the taskset command might aid in reasoning about that.
At the risk of showing my ignorance… How do you measure if it’s I/O or processing? If it is the processing step, how do you figure out which part? I’ve used callgrind before, but it seems to “bottom out” showing a function accounts for most of the time, but the next level down simply has many calls to a function that’s seemingly fast, so I’m left not knowing what to optimize. Just curious what folks use for profiling their code, with a focus on optimizing it.
There’s no silver bullet. Performance Optimization is a skillset that gets built up over time, and is mostly measurement and intuition. The measurement comes from profilers and benchmarks (which you should get in the habit of doing in some cases) and intuition to know what is likely to be slow, and most importantly what likely matters if it’s slow.
There’s a famous (but often subquoted without the context) Knuth quote here:
“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
It’s worth noting that intuition is very dangerous here because it changes over time. I first encountered this with some code optimised by someone whose intuition was routed in the ‘80s. He put a lot of state that needed to be passed through multiple levels of function call before being used into globals, to avoid stack pushes and pops. I rewrote it in the naive form with function arguments and it got 20% faster. The compiler was able to keep the values in registers and then had useful alias information, so generated better code.
These days, caches and rename register pressure are have a huge impact on performance. I don’t think I saw rename register pressure affect anything 10 years ago. I’ve had several conversations with systems programmers over the last few years where they’d been worried about performance effects based on a mental model of how computers work that’s 20 years out of date.
That’s completely fair! I tried to account for that with the habitual measurement part, but that only addresses part of the problem! You won’t necessarily understand why the measurements are changing over time. On the other hand, I bet creativity, trial and error, and measurement can substitute for full understanding of the architecture with good results, a lot of the time.
Also of note, I realized that my comment also forgot the very important “Understanding of how to fix the issues” which is some combination of creativity, and understanding of “systems”, data structures, algorithms, etc.
I wish people always gave the full version of that quote with attribution and not the “premature optimization is the root of all evil” part taken out of context indeed!
In particular, his USE method is a great starting point: it gives you a framework to use when investigating any kind of computer performance issue. For Linux in particular, he has a whole page of Linux Performance tools and techniques.
I also enjoy Bruce Dawson’s blog Random Ascii. He has more of a Windows focus rather than Linux (he’s the author of UIforETW) but a lot of the general methodology is cross-platform.
The optimized Go implementation will count words twice when they cross the buffer boundary. Which is a bit of a shame, as the Go standard library provides not one but two(!) interfaces for exactly this task: bufio.ReadString and bufio.Scanner.
My other observation would be that 413MB isn’t ‘large’ in a meaningful sense. It’s small enough to read completely into memory on an early-2000s laptop. It’s small enough to comfortably fit in the DRAM cache of a consumer-grade SSD. Which doesn’t mean that this article’s intuition is completely wrong, but that the ratio might change considerably if datasets that are actually large are used.
Can you show me (or give a test case) for how it counts words twice when they cross the buffer boundary? I’ve double-checked the code and run various tests with a reduced buffer size, and it doesn’t seem to do any double-counting. I used bufio.Scanner in the original simple Go version, but it’s hard to separate out read time from processing time, because the scanner does both, so for the simple version in this article I’m just reading the whole file. For the optimized version, reading large chunks and then scanning the bytes by hand is faster, because it avoids allocations.
You’re probably right that I should have used a larger dataset. Still, I think it proves the main point of the article, that disk I/O isn’t the bottleneck for processing like this. I don’t think going up 10x or 100x would have changed that.
on a modern numa system reading from and writing to main memory is the biggest kind of i/o bottleneck. cpu cache is the new ram and main memory is the new disk
This is definitely an interesting point to keep in mind. I remember often thinking things that “can’t matter relative to I/O” (things like Django models getting built up rather than using .values_list) end up being significant contributors to slowness.
As usual, of course, the answer to all of this is measure, measure, measure.
This also impacts operating system design. FreeBSD’s GEOM stack was originally built with some fairly expensive abstractions that were assumed to be fine given the 1ms+ latency of even the fastest storage devices. SSDs and, especially, NVMe changed that and a lot of the work over the last decade has been carefully punching through some of those layers for places where the CPU had suddenly become the bottleneck.
Disk encryption is a particularly interesting case where (not specific to FreeBSD) the calculation has changed a few times. With spinning rust, decompression added negligible latency. With SSDs, it suddenly became noticeable. Then newer compression algorithms (lz4 and zstd, in particular) came along and a single core was able to do stream decompression at the disk’s line rate. This was further improved by the fact that modern hardware can DMA directly to the last-level cache, so you read compressed data from disk to cache, decompress in cache, and then write decompressed data out to DRAM. The DRAM bandwidth is far wider than the disk bandwidth and so the disk remains the bottleneck here and the compression improves throughput with a negligible latency hit again, just as it did 20 years ago.
Disk encryption is a particularly interesting case…
…Then newer compression algorithms
The topic of this paragraph suddenly changed halfway through from encryption to compression. Was that deliberate?
At about the same time, AES acceleration became pretty much ubiquitous. Offhand, a 2016 Intel Mac with AES-NI gets about 6GB/s doing AES-GCM on one CPU core.
No, my brain is still a bit fuzzy from the cold I’ve been suffering from. Encryption was meant to read compression. Encryption is also interesting though, with AES hardware on modern COUs. Before that, PCI crypto accelerators had interesting performance characteristics, with both high latency and high throughput, making them useful in some situations but a source of overhead in others.
He has a link in his article to “this method” where he uses the Linux vm drop caches to get cold cache times. This seems to have been overlooked by a great many article readers both here and on the orange site. Probably should have been bold-faced or something. :-)
Also, he is not writing to disk. So, a post-sync is not relevant.
And it is theoretically possible that his NVMe device has its own DRAM buffer, but you only mentioned FS-in-OS-like concepts. In light of this, though, it would be a better benchmark design to read a bunch of unrelated data to flush out any device-DRAM in between cold cache runs (multiple not singular to get some idea of variation).
Lately, I have liked suggesting something like repeat 10 emit FP time... | sort -g|head -n3|mean-stderrMean at a minimum. This only gives you an upper bound to the t0 in measured time = t0 + noise, but it at least filters some heavy-tailed system noise and gives some assessment of run-to-run variation to see if your setup moderately controls that. Granted, this means you need like 20 runs (2*the repeat 10 to assess run-to-run), but that is not so high a price to pay for credible reproducibility on at least the same test machine. It is at least often easier than rebooting into a single-user, fixed CPU frequency mode (which can also help reproduction, and even get you down to single digit microsecond measurement errors) and there is less to understand than fancy statistical estimator models for the true minimum t0.
I am working on a project to generate 1 billion rows in SQLite under a minute and inserted 100M rows inserts in 33 seconds. First, I generate the rows and insert them in an in-memory database, then flush them to the disk at the end. To flush it to disk it takes only 2 seconds, so 99% of the time is being spent generating and adding rows to the in-memory B Tree.
For Python optimisation, have you tried PyPy? I ran my same code (zero changes) using PyPy, and I got 3.5x better speed.
I published my findings here.
An easy way to do this might be to just create the DB in /dev/shm (on Linux anyway) and later copy it to a persistent device (if you have enough DIMM space).
Sending
PRAGMA synchronous=off
to SQLite or using libeatmydata might be easier; either way your goal is to skip fsync / fdatasync calls (since this data could just be re-generated anyway).This is what I am doing!
This is an interesting observation and I think the message is valid, but I don’t think the benchmark results paint an accurate picture. The author is completely utilizing the computer’s disk I/O bandwidth, but the CPU cores are only utilised at (I’ll take a guess) 12.5%, assuming 8 hyper threads available. I think this distinction is particularly important in the big data case mentioned at the end.
It may be far worse. As Python has no “@associative” built in to counter, it is probably running on a single core. Said core handling overhead of data movement as well.
That said, it is a guess. My humble view is that we are no longer capable of optimizing without careful analysis of actual profiling data. Our guesses and logic are no match for chip pipeline and thread oddness.
Yes, there are definitely straight-forward ways to parallelize this particular problem. That’s part of the reason I put “big data” in quotes. Often data just isn’t “big” enough to worry about it – if I can get the answer I want out of a 400MB or even 4GB input file in a few seconds, I’ll probably stop there. However, if I have to write a more general tool that I’ll use over and over, I’d parallelizing it if it was slow (for some value of slow).
Maybe it relates to the international nature of Internet communication/English as a second language effects, but a lot of readers seem to not grasp the use of scare quotes for “non-standard” or “difficult to interpret” or “context dependent/no standard even exists, really”. FWIW, I also love them as a pithy “author understands interpretational risk” signal. I don’t have any better answer than more elaborate text – like “some value of”/“in context”/etc. qualifier constructions, but the resulting more complex grammar can be a problem for the very same readers.
This is all especially an issue in performance of anything where almost everything “all depends” upon, well, quite a lot. E.g., these days (post Nehalem Intel/L3) DIMM reads themselves depend quite a lot on whether done in parallel or serial. So, I have an NVMe that goes 7 GiB/s while your DIMM estimate was 10 G, but I’d bet your parallel DIMM reads are 25..30. Big “scalable” Intel multi-core could see a range of from 8 to 50 GiB/s (or more, even a whole order of magnitude). Personally, it annoys me that I am sometimes forced to go parallel (not necessarily multi-threaded) with all its attendant complexity on an otherwise very serial problem just to saturate a serial memory bus (yes, I know NUMA can be a thing). As this paragraph maybe shows, qualifying everything gets indigestible fast. ;-)
Anyway, @enobayram, to add a little more detail on your point - this is one way to parallelize the problem in Nim and note that scaling up by just doing 10X something like the KJ bible keeps hash tables artificially small relative to natural language vocabulary use which matters since fitting each vocabulary-scaled histogram in the private (i.e. maybe non-hyper threaded) L2 can make a big perf difference at some data scales. Whether hyper-threaded CPU resource sharing helps also “all depends”. At least on Linux, something like the
taskset
command might aid in reasoning about that.At the risk of showing my ignorance… How do you measure if it’s I/O or processing? If it is the processing step, how do you figure out which part? I’ve used callgrind before, but it seems to “bottom out” showing a function accounts for most of the time, but the next level down simply has many calls to a function that’s seemingly fast, so I’m left not knowing what to optimize. Just curious what folks use for profiling their code, with a focus on optimizing it.
There’s no silver bullet. Performance Optimization is a skillset that gets built up over time, and is mostly measurement and intuition. The measurement comes from profilers and benchmarks (which you should get in the habit of doing in some cases) and intuition to know what is likely to be slow, and most importantly what likely matters if it’s slow.
There’s a famous (but often subquoted without the context) Knuth quote here:
— “Structured Programming with goto Statements”
It’s worth noting that intuition is very dangerous here because it changes over time. I first encountered this with some code optimised by someone whose intuition was routed in the ‘80s. He put a lot of state that needed to be passed through multiple levels of function call before being used into globals, to avoid stack pushes and pops. I rewrote it in the naive form with function arguments and it got 20% faster. The compiler was able to keep the values in registers and then had useful alias information, so generated better code.
These days, caches and rename register pressure are have a huge impact on performance. I don’t think I saw rename register pressure affect anything 10 years ago. I’ve had several conversations with systems programmers over the last few years where they’d been worried about performance effects based on a mental model of how computers work that’s 20 years out of date.
That’s completely fair! I tried to account for that with the habitual measurement part, but that only addresses part of the problem! You won’t necessarily understand why the measurements are changing over time. On the other hand, I bet creativity, trial and error, and measurement can substitute for full understanding of the architecture with good results, a lot of the time.
Also of note, I realized that my comment also forgot the very important “Understanding of how to fix the issues” which is some combination of creativity, and understanding of “systems”, data structures, algorithms, etc.
I have heard the quote about premature optimization countless times. I didn’t know the author was Knuth.
I wish people always gave the full version of that quote with attribution and not the “premature optimization is the root of all evil” part taken out of context indeed!
Brendan Gregg has a lot of good guides to performance optimization: https://brendangregg.com/
In particular, his USE method is a great starting point: it gives you a framework to use when investigating any kind of computer performance issue. For Linux in particular, he has a whole page of Linux Performance tools and techniques.
I also enjoy Bruce Dawson’s blog Random Ascii. He has more of a Windows focus rather than Linux (he’s the author of UIforETW) but a lot of the general methodology is cross-platform.
The optimized Go implementation will count words twice when they cross the buffer boundary. Which is a bit of a shame, as the Go standard library provides not one but two(!) interfaces for exactly this task:
bufio.ReadString
andbufio.Scanner
.My other observation would be that 413MB isn’t ‘large’ in a meaningful sense. It’s small enough to read completely into memory on an early-2000s laptop. It’s small enough to comfortably fit in the DRAM cache of a consumer-grade SSD. Which doesn’t mean that this article’s intuition is completely wrong, but that the ratio might change considerably if datasets that are actually large are used.
Can you show me (or give a test case) for how it counts words twice when they cross the buffer boundary? I’ve double-checked the code and run various tests with a reduced buffer size, and it doesn’t seem to do any double-counting. I used
bufio.Scanner
in the original simple Go version, but it’s hard to separate out read time from processing time, because the scanner does both, so for the simple version in this article I’m just reading the whole file. For the optimized version, reading large chunks and then scanning the bytes by hand is faster, because it avoids allocations.You’re probably right that I should have used a larger dataset. Still, I think it proves the main point of the article, that disk I/O isn’t the bottleneck for processing like this. I don’t think going up 10x or 100x would have changed that.
You are right; my apologies, I misread the code and assumed there was a bug.
on a modern numa system reading from and writing to main memory is the biggest kind of i/o bottleneck. cpu cache is the new ram and main memory is the new disk
This is definitely an interesting point to keep in mind. I remember often thinking things that “can’t matter relative to I/O” (things like Django models getting built up rather than using
.values_list
) end up being significant contributors to slowness.As usual, of course, the answer to all of this is measure, measure, measure.
This also impacts operating system design. FreeBSD’s GEOM stack was originally built with some fairly expensive abstractions that were assumed to be fine given the 1ms+ latency of even the fastest storage devices. SSDs and, especially, NVMe changed that and a lot of the work over the last decade has been carefully punching through some of those layers for places where the CPU had suddenly become the bottleneck.
Disk encryption is a particularly interesting case where (not specific to FreeBSD) the calculation has changed a few times. With spinning rust, decompression added negligible latency. With SSDs, it suddenly became noticeable. Then newer compression algorithms (lz4 and zstd, in particular) came along and a single core was able to do stream decompression at the disk’s line rate. This was further improved by the fact that modern hardware can DMA directly to the last-level cache, so you read compressed data from disk to cache, decompress in cache, and then write decompressed data out to DRAM. The DRAM bandwidth is far wider than the disk bandwidth and so the disk remains the bottleneck here and the compression improves throughput with a negligible latency hit again, just as it did 20 years ago.
The topic of this paragraph suddenly changed halfway through from encryption to compression. Was that deliberate?
At about the same time, AES acceleration became pretty much ubiquitous. Offhand, a 2016 Intel Mac with AES-NI gets about 6GB/s doing AES-GCM on one CPU core.
No, my brain is still a bit fuzzy from the cold I’ve been suffering from. Encryption was meant to read compression. Encryption is also interesting though, with AES hardware on modern COUs. Before that, PCI crypto accelerators had interesting performance characteristics, with both high latency and high throughput, making them useful in some situations but a source of overhead in others.
I have also been surprised by how slow Django JSON serialization can be.
Call me a fanboy (I am) but I’m wondering how fast a Haskell implemention with a prefix tree and naive IO would be.
This might be a silly question. Aren’t writes and reads to most modern filesystems not buffered?
The author might be measuring memory access times. What happens if he runs sync afterwards?
He has a link in his article to “this method” where he uses the Linux vm drop caches to get cold cache times. This seems to have been overlooked by a great many article readers both here and on the orange site. Probably should have been bold-faced or something. :-)
Also, he is not writing to disk. So, a post-sync is not relevant.
And it is theoretically possible that his NVMe device has its own DRAM buffer, but you only mentioned FS-in-OS-like concepts. In light of this, though, it would be a better benchmark design to read a bunch of unrelated data to flush out any device-DRAM in between cold cache runs (multiple not singular to get some idea of variation).
Lately, I have liked suggesting something like
repeat 10 emit FP time... | sort -g|head -n3|mean-stderrMean
at a minimum. This only gives you an upper bound to thet0
inmeasured time = t0 + noise
, but it at least filters some heavy-tailed systemnoise
and gives some assessment of run-to-run variation to see if your setup moderately controls that. Granted, this means you need like 20 runs (2*the repeat 10 to assess run-to-run), but that is not so high a price to pay for credible reproducibility on at least the same test machine. It is at least often easier than rebooting into a single-user, fixed CPU frequency mode (which can also help reproduction, and even get you down to single digit microsecond measurement errors) and there is less to understand than fancy statistical estimator models for the true minimumt0
.