So, this is very interesting exercise, but I think it’s also a case of optimizing out for what the .NET runtime is already very good at handling (lots of short-lived allocations), and is explicitly optimized for (via the GC Nursery).
What’s interesting is that the difference in the two metrics that aren’t getting optimized for (Peak Working Set and total time in MS) don’t improve by nearly as much as the improvement in total allocations, and which feel like the real metrics worth keeping an eye on. The time improvement goes from 8.7 seconds to 6.7 seconds and the peak working set goes from 16.7 mb to 12.6mb. These improvements may or may not be worth the increase in the complexity of the code (Certainly rolling one’s own implementations of decimal and integer parsing starts to go into “Are we re-inventing too much?” territory, though not too).
While these certainly are improvements, and would be great to get if they are spread out across hundreds to thousands of copies of this process. After all, it’s a ~20% savings in time/memory, and if that’s spread across $1000 or more of cloud compute resources, you just saved $200 per billing period, which is commendable.
I’m actually impressed by how well the .NET runtime performs, but at the same time, the code started in a place of avoiding some of the truly bad mis-coding I’ve seen in C# ETL (things like reading a file into string, one line at a time, concatenating it each time, leading to O(N^2) allocations and compute time). This isn’t the overnight to 20 minutes performance improvement, it’s a much more incremental improvement in an area where the baseline is already pretty decent.
If the strings are immutable, why is string.Split(”,”) allocating so much? In Go for example it would returns a slice to the original strict which is only a few bytes long.
Thanks, I don’t know anything about the .NET environment.
It seems like OP could have used a version of Split that returns an array of Span and he would have gotten 90% close while keeping the code straight-forward.
.NET never shares substrings like Java or Go (the object layout is “Pascal-style”, length followed by content, so you can’t do this without changing representations)
I hope the shop this guy works for isn’t using Microsoft SQL Server. That’s the only justification I can think of right now for writing custom ETL code instead of using SQL Server Integration Services, or any of the other tools SQL Server provides for handling bulk import/export.
I don’t really understand this. Sure, it’s cool to optimize something so well, but I don’t see the point of going to so much effort to reduce memory allocations. The time taken to run this, what it seems like you would actually care about, is all over the place and doesn’t get reduced that much. Why do we care about the number of allocations and GC cycles? If you care that much about not “stressing the GC”, whatever that means, then better to switch to a non-GC language than jump through hoops to get a GC language to not do its thing.
On the contrary, I found this article a refreshing change from the usual Medium fare. Specifically, this article is actually technical, has few (any?) memes, and shows each step of optimization alongside data. More content like this, please!
More to your point, I imagine there was some sort of constraint necessitating it. The fact that the allocation size dropped so drastically fell out of using a pooled allocator.
This data is then used to power our real-time calculations. Currently this import process has to take place outside of business hours because of the impact it has on memory usage.
So: They’re doing bulk imports of data, and the extra allocation produces so much overhead that they need to schedule around it (“outside of business hours”). Using 7.5GB may be fine for processing a single input batch on their server, but it’s likely they want to process several data sets in parallel, or do other work.
Sure, they could blast the data through a DFA in C and probably do it with no runtime allocation at all (their final code is already approaching a hand-written lexer), but completely changing languages/platforms over issues like this has a lot of other implications. It’s worth knowing if it’s manageable on their current platform.
They’re doing bulk imports of data, and the extra allocation produces so much overhead that they need to schedule around it
That’s what they claim, but it sounds really weird to me. I’ve worked with plenty of large data imports in GCed languages, and have never had to worry about overhead, allocation, GC details, etc. I’m not saying they don’t have these problems, but it would be even more interesting to hear why these things are a problem for them.
Also of note - their program never actually used 7.5GB of memory. That’s the total allocations over the course of the program, virtually all of which was surely GC’ed almost immediately. Check out the table at the end of the article - peak working set, the highest amount of memory actually used, never budged from 16kb until the last iteration, where it dropped to 12kb. Extra allocations and GC collections are what dropped. Going by the execution time listing, the volume of allocations and collections doesn’t seem to have much noticeable effect on anything. I’d very much like to know exactly what business goals they accomplished by all of that effort to reduce allocations and collections.
You’re right – it’s total allocations along the way rather than the allocation high water mark. It seems unlikely they’d go out of their way to do processing in off hours without running into some sort of problem first (so I’m inclined to take that assertion at face value), though I’m not seeing a clear reason in the post.
Still, I’ve seen several cases where bulk data processing like this has become vastly more efficient (from hours to minutes) by using a trie and interning common repeated substrings, re-using the same stack/statically allocated buffers, or otherwise eliminating a ton of redundant work. If anything, their timings seem suspicious to me (I’d expect the cumulative time to drop significantly), but I’m not familiar enough with the C# ecosystem to try to reproduce their results.
From what I understood, the 7.5GB of memory is total allocations, not the amount of memory held resident, that was around 15 megs. I’m not sure why the memory usage requires running outside business hours.
EDIT: Whoops, I see you responded to a similar comment that showed up below when I was reading this.
The article doesn’t explain why they care, but many garbage collection make it hard to hit a latency target consistently (i.e. while the GC is running its longest critical section). Also, garbage collection is (usually better optimized for short-living allocations than malloc, but still) somewhat expensive, and re-using memory makes caches happier.
Of course, there’s a limit to how much optimization one needs for a CSV-like file in the hundreds of MBs…
As shown in the table, they don’t use anywhere close to 8gb of memory at a time. This seems like a case that .NET is already very good at at a baseline level
So, this is very interesting exercise, but I think it’s also a case of optimizing out for what the .NET runtime is already very good at handling (lots of short-lived allocations), and is explicitly optimized for (via the GC Nursery).
What’s interesting is that the difference in the two metrics that aren’t getting optimized for (Peak Working Set and total time in MS) don’t improve by nearly as much as the improvement in total allocations, and which feel like the real metrics worth keeping an eye on. The time improvement goes from 8.7 seconds to 6.7 seconds and the peak working set goes from 16.7 mb to 12.6mb. These improvements may or may not be worth the increase in the complexity of the code (Certainly rolling one’s own implementations of decimal and integer parsing starts to go into “Are we re-inventing too much?” territory, though not too).
While these certainly are improvements, and would be great to get if they are spread out across hundreds to thousands of copies of this process. After all, it’s a ~20% savings in time/memory, and if that’s spread across $1000 or more of cloud compute resources, you just saved $200 per billing period, which is commendable.
I’m actually impressed by how well the .NET runtime performs, but at the same time, the code started in a place of avoiding some of the truly bad mis-coding I’ve seen in C# ETL (things like reading a file into string, one line at a time, concatenating it each time, leading to O(N^2) allocations and compute time). This isn’t the overnight to 20 minutes performance improvement, it’s a much more incremental improvement in an area where the baseline is already pretty decent.
If the strings are immutable, why is string.Split(”,”) allocating so much? In Go for example it would returns a slice to the original strict which is only a few bytes long.
System.String.Split
predates Span et al; so it likely is allocating new strings instead.Thanks, I don’t know anything about the .NET environment.
It seems like OP could have used a version of Split that returns an array of Span and he would have gotten 90% close while keeping the code straight-forward.
.NET never shares substrings like Java or Go (the object layout is “Pascal-style”, length followed by content, so you can’t do this without changing representations)
The pull request from David Fowl is also worth a read. It uses Spans and System.IO.Pipelines.
I hope the shop this guy works for isn’t using Microsoft SQL Server. That’s the only justification I can think of right now for writing custom ETL code instead of using SQL Server Integration Services, or any of the other tools SQL Server provides for handling bulk import/export.
I don’t really understand this. Sure, it’s cool to optimize something so well, but I don’t see the point of going to so much effort to reduce memory allocations. The time taken to run this, what it seems like you would actually care about, is all over the place and doesn’t get reduced that much. Why do we care about the number of allocations and GC cycles? If you care that much about not “stressing the GC”, whatever that means, then better to switch to a non-GC language than jump through hoops to get a GC language to not do its thing.
On the contrary, I found this article a refreshing change from the usual Medium fare. Specifically, this article is actually technical, has few (any?) memes, and shows each step of optimization alongside data. More content like this, please!
More to your point, I imagine there was some sort of constraint necessitating it. The fact that the allocation size dropped so drastically fell out of using a pooled allocator.
Right at the beginning of the article, it says:
So: They’re doing bulk imports of data, and the extra allocation produces so much overhead that they need to schedule around it (“outside of business hours”). Using 7.5GB may be fine for processing a single input batch on their server, but it’s likely they want to process several data sets in parallel, or do other work.
Sure, they could blast the data through a DFA in C and probably do it with no runtime allocation at all (their final code is already approaching a hand-written lexer), but completely changing languages/platforms over issues like this has a lot of other implications. It’s worth knowing if it’s manageable on their current platform.
That’s what they claim, but it sounds really weird to me. I’ve worked with plenty of large data imports in GCed languages, and have never had to worry about overhead, allocation, GC details, etc. I’m not saying they don’t have these problems, but it would be even more interesting to hear why these things are a problem for them.
Also of note - their program never actually used 7.5GB of memory. That’s the total allocations over the course of the program, virtually all of which was surely GC’ed almost immediately. Check out the table at the end of the article - peak working set, the highest amount of memory actually used, never budged from 16kb until the last iteration, where it dropped to 12kb. Extra allocations and GC collections are what dropped. Going by the execution time listing, the volume of allocations and collections doesn’t seem to have much noticeable effect on anything. I’d very much like to know exactly what business goals they accomplished by all of that effort to reduce allocations and collections.
You’re right – it’s total allocations along the way rather than the allocation high water mark. It seems unlikely they’d go out of their way to do processing in off hours without running into some sort of problem first (so I’m inclined to take that assertion at face value), though I’m not seeing a clear reason in the post.
Still, I’ve seen several cases where bulk data processing like this has become vastly more efficient (from hours to minutes) by using a trie and interning common repeated substrings, re-using the same stack/statically allocated buffers, or otherwise eliminating a ton of redundant work. If anything, their timings seem suspicious to me (I’d expect the cumulative time to drop significantly), but I’m not familiar enough with the C# ecosystem to try to reproduce their results.
From what I understood, the 7.5GB of memory is total allocations, not the amount of memory held resident, that was around 15 megs. I’m not sure why the memory usage requires running outside business hours.
EDIT: Whoops, I see you responded to a similar comment that showed up below when I was reading this.
The article doesn’t explain why they care, but many garbage collection make it hard to hit a latency target consistently (i.e. while the GC is running its longest critical section). Also, garbage collection is (usually better optimized for short-living allocations than malloc, but still) somewhat expensive, and re-using memory makes caches happier.
Of course, there’s a limit to how much optimization one needs for a CSV-like file in the hundreds of MBs…
Maybe their machines don’t have 8gb of free memory lying around.
As shown in the table, they don’t use anywhere close to 8gb of memory at a time. This seems like a case that .NET is already very good at at a baseline level