I think what is needed is a process group (instead of prctl). Then the queue manager can clean up the child processes by sending a signal to the process group when its leader exits.
This involves… PTYs if a vague horrible memory I have tried to suppress is correct?
Googling leads to setsid(), and its man page says:
If a process that is a session leader terminates, then a SIGHUP
signal is sent to each process in the foreground process group of
the controlling terminal.
I am going to try to once again suppress all knowledge in this area so I’m never tempted to try to do anything involving it.
If the queue manager is the normal sort of daemon, it’ll call setsid() to create a session without a controlling terminal, so there’s no need for ptys to be involved. Call setpgid() after forking a worker; when the worker exits, send a signal to its process group to get rid of any stragglers.
I’m really eager to try out the Mono variant. I’ve been maining Iosevka Comfy (which has been renamed now I believe), but Atkinson Hyperlegible is my favorite font to read our utside of my editor!
Managed Kubernetes clusters have a baseline cost that’s quite high, so they saved money by running a single-machine Kubernetes distribution on a single VM, at 1/4 the cost. Which is important for a free community service as in this case, but also relevant for small businesses.
One key point I got out of this is that sampling profilers are really bad at catching intermittent latency spikes (or worst case latency, the author’s words).
Another argument for the idea that different performance use cases need different kinds of profiler measurement and different kinds of visualizations.
tracemalloc count the amount of memory alloc’ed. It doesn’t tell the actual amount of memory used as tracemalloc doesn’t track freed memory. Looking at the process rss over the iteration tells a very different story. As it turns out Python GC is able to free the small buffer between iterations making read memory usage a non-issue.
import psutil
import io
read_size = 1000
buffer_size = 100 * 10**6
buffer = io.BytesIO(b"0"* buffer_size)
process = psutil.Process()
for i in range(buffer_size // read_size):
x = buffer.read(read_size)
if i % (buffer_size // read_size // 10) == 0:
print(process.memory_info().rss) # in bytes
On my machine, the code above peak at 116 621 312 bytes (vs 116 129 792 bytes without the loop).
Just saw that the first snippet in the article does a single read. I guess in that case the article is right, but it could be argued that .read() is a bit of an anti-pattern to begin with.
Yeah, I was specifically writing about the case where you want to or need to do a single read(); it is the desired pattern in some cases, or sadly unavoidable without major refactoring, as in the Polars case I encountered. But if you’re doing a series of small read()s, which depending on the use case is what you want, you don’t need getvalue(). I should mention that in the article, I’ll update it.
Pretty impressive on its own, but compounded more so by the fact that he is in final year of high school. Project’s github here: https://github.com/Hello9999901/laptop
I don’t see high school anywhere in there. “time-independent and dependent Schrödinger equations” require 5 semesters of calculus, which is usually only available in university.
He is a high school senior right now, at Philips Exeter Academy. Granted, it is quite possible it’s not your average high school; here’s the author talking about the topics they’ve learnt (copied from the orange site):
We discussed wave functions, probability, fermions/bosons, did calculations for particle in a box, the Schrödinger model, and went just up to deriving the hydrogen atom. Nothing super fancy, but it was one heck of an experience!
If you’re using Rust from Python, https://github.com/mityax/rustimport has the benefit that you can then rely on framework-level reimports, e.g. I suspect the mechanism in Flask dev mode would also mean the Rust code gets recompiled on each request iff it’s changed.
I support efforts to mitigate climate change, and I like performance optimization, but I’m deeply skeptical of this line of thinking.
The thing is that organizations already have a strong economic incentive to reduce program runtime because there are economic costs (cost of servers, cost of slower processes). I believe (though I must acknowledge I’m not citing solid evidence) that those costs dominate the cost of electricity (with or without externalities–and I would favor making those externalities explicit with carbon taxes or cap and trade).
If that’s true, then there’s no significant reason to talk about the carbon emissions–doing what is economically rational will reduce them anyway.
Of course, companies will often not do what’s economically rational, because they’re poorly managed. But most of those poorly managed companies also won’t say “forget what’s profitable, lets reduce our emissions”.
The article gives one example outside the realm of corporate money saving: widely used open source projects. Empirically some of them could be much much faster given some resources.
I am skeptical of arguments that rely on corporate rationality, because:
The incentives of the individual aren’t necessarily that of the organization.
There’s research showing ROI required for getting rid of personnel is lower than ROI required for efficiency of other kinds. I don’t have time to track it down (should really bookmark when I do), so take it with a large grain of salt, but I’ve certainly empirically experienced companies making my customer experience dramatically worse by replacing humans with “AI”…
I once interviewed at a company that did electric demand response for companies (“we’ll pay you $X to shut down your high-electric-usage devices when power usage is surging” aka negative power plants). They also ran an energy efficiency audit service. I asked, and was told that (a) the energy efficiency audit tended to save the company more money than they made from demand response (b) companies (which devolves to managers) were much more excited about payment for demand response than they were in efficiency. So a guarantee of decreasing costs was valued less than a gamble on getting revenue.
All that being said, yes, reducing costs does sometimes push companies in the right direction on efficiency. Certainly hyperscalers have invested a lot in datacenter efficiency and will pay for developing things like a 5% faster hashmap if that means a 0.1% reduction in server costs.
By watching Not Just Bikes youtube videos I learned about induced demand in transportation: increasing road capacity does not result in reduced congestions. In response to the added capacity, more people decide to drive a car, and sit in the same and even larger congestions.
Wouldn’t something similar apply here? Making programs run faster may not result in less CPU time spent. An optimized game engine will load cut scenes faster, and make the game more addictive. A faster JS engine will allow adtech companies to send even more junk with every page load. A leaner linux kernel will allow it to run on smaller chips, and we will get IoT crap on more appliances. And so on.
I’m not saying we shouldn’t try to make things more efficient. I’m saying, this could potentially result in more computation (and energy spent), not less.
Yes, the same thing will happen. In the context of resource efficiency, the phenomenon is called the Jevons Paradox. Making a resource’s usage more efficient at a societal level causes more of it to be used, not less. Faster software won’t reduce emissions, it will just make us run more software.
Induced demand only applies when people actually want something. People want to drive, so when driving becomes cheaper, more people will drive. Only because of the large quantity of people who want to drive somewhere combined with the geometry of the world, this results in congestion never being fixed.
For an example of induced demand being good, look at trains. If you make trains run more often, they may become more popular, as they are now more useful for getting between places. They may become crowded, but they’ll get to their destinations just as fast. And if the trains are getting quite crowded, well, that’s a good argument for running more trains, increasing frequency, which makes the service better yet again, and may induce even more demand. So, the service becomes better with induced demand in one axis (frequency) but not necessarily another (crowding). But there’s another crucial difference: the space savings of trains compared to cars is so great, that it actually is possible to fully meet all demand, so inducing demand until trains are too crowded isn’t inevitable, the way congestion is inevitable with cars.
Finally, we get to efficient programs and computers. More efficient programs probably would induce demand. But how much? It can only happen to the extent that there is demand for running programs currently not met because it’d be too expensive. How large is that extent? It is surely not infinite. How much more efficient can we make our programs? The question of whether we’d use more energy or less depends on the answers to those questions.
People want to drive, so when driving becomes cheaper, more people will drive.
I mean, they want to get to their destination, they believe driving is the only valid way to do that, and they reason that more lanes means more ez ride.
It’s frustrating because (here in Chicago) it is absolutely true that trains are late, busses bunch up on routes, and scheduled routes are sometimes too infrequent, but folks who use them as reason not to use transit ignore the indignities they already experience while driving. Oh well.
Am I off topic? At the very least I’m looking too far into the analogy. Pardon me.
I did. Your post is focused on reducing power consumption and power-efficient code writing, and to my understanding, it’s technically correct. Good stuff.
But, I got somewhat triggered by the “Reducing CO₂ emissions” title and I consider it a click-bait. My point is: writing code with low energy consumption in mind is great. But slapping CO₂ emissions-related title on it - not so much. Even if your code would be run - like curl - on most electronic devices on Earth, the impact on CO₂ emissions is - to say politely - “marginal”.
Far-fetched assumption, not backed up by benchmarks, but I’d say that when it comes to how Python is used nowadays around power-hungry tasks (making GPUs go brrrrr), there’s near-zero impact of making Python code be twice faster.
Unfunny joke: you can use threading and such to draw even more power from more GPUs. But seriously - while Python is not C - it’s almost never the problem when it comes to power consumption (we’re not talking IoT and code running on low-power-battery-powered devices). And I’ll raise the “Python is nowadays used to do ML stuff” over and over in this discussion.
It’s not quite the same as it was back then (but until the GIL is actually gone-gone it still is an issue) but I had a really funny result doing some experiments in grad school around Python threads. This would have been around 2009.
The gist of it… I instrumented the GIL with RDTSC to keep track of lock contention while varying the number of cores that my threaded Python program could use. I made sure that I wasn’t using C libraries that release the GIL (e.g. numpy) and I made sure that my program was CPU-bound but not cross-thread mutex bound. At a coarse level I was measuring overall execution time and at a fine-grained level I was keeping track of all of the GIL request/acquire/release times so that I could look for specific behaviour patterns while varying the number of CPU cores.
My expected result was that adding extra cores would result in a slight slowdown due to cache misses and lock acquisition but otherwise would perform similarly to single-thread/single-core for processing all of the data. What I ended up seeing, though, was good performance on 1 core, similar performance on 2 cores, and dramatically worse performance on 3 cores and 4 cores. Like… a performance drop that I couldn’t explain using any kind of theoretical model I could think of. And then… I saw it.
The system I was using for benchmarking (bare metal, no VMs or containers) was configured for CPU clock scaling. When I distributed my process across 4 cores the GIL limited each of the threads to only 25% CPU usage on each core. Frequency scaling saw that the CPU was underutilized and scaled down the clock frequency. But… because of the GIL it continued to only use 25% of each core, which meant that the clock got dropped all the way down to its minimum. So, lol, adding more threads can make it potentially more energy efficient if you design your experiment sufficiently poorly!
I wrote this question-comment yesterday, left it for a next-day review, then deleted, and now I’m rewriting. Fuck it if I come off as an idiot, but I’m nerdy-curious and just have to take the risk and ask…
I made sure that my program was CPU-bound but not cross-thread mutex bound.
I distributed my process across 4 cores the GIL limited each of the threads to only 25% CPU usage on each core.
Was anything around this done within Python, and not on OS level? I’m interested in the “Pythonic” vs. operating system’s (sysctl etc) manner of setting stuff up, and super curious of how you both set/limited, and tested CPU usage. In detail.
EDIT: also: have you pinned these threads to particular P-cores or E-cores, or tested all this on P-cores, or a CPU with no such distinction?
And as far as I can recall the CPUs that I would have had access to at the time didn’t make a P-core/E-core distinction. It likely would have had old-style Hyperthreading though and I’m pretty sure I would have explored that a little bit but if I did the results weren’t surprising enough to write themselves to long-term memory :)
As far as how I was tracking CPU usage, it likely would have just started with some flavour of top followed by likely a Python script that would do the same thing top was doing (reading stuff out of /proc?) and logging it on a relatively fine-grained interval.
Finally: those are not dumb questions at all! Those, to me, indicate that you really get the kind of issues that can trip up benchmarks like this. I’d love to go back and repeat these kinds of experiments 15 years later, especially with the new “we’ve got a road to eliminating the GIL” stuff that landed recently but… no one’s paying me to do systems research anymore (or, honestly, to write Python code) so someone else will have to pick up that torch.
Instead of wagging your finger at countries that have been victimized for centuries by colonial powers, a much more productive strategy would be to loudly advocate for nuclear energy in Europe instead of whatever harebrained ‘renewable energy’ (powered by African-mined batteries) scheme they are concocting. Look to the French, they are actually decarbonizing. You can tell because the Germans are trying to penalize them for not meeting ‘renewable energy’ targets (note, not decarbonizing but renewable energy).
I expected this to be implemented using a ring buffer, so I was surprised to see:
if n_bytes > unfilled {
// we need to cycle the buffer
// there's probably a more efficient way to do this
for i in self.pos..self.filled {
self.buffer[i - self.pos] = self.buffer[i];
}
self.buffer.truncate(self.desired_capacity);
self.filled -= self.pos;
self.pos = 0;
}
which is a greater-than-zero-copy (and memmove is probably the “more efficient way” you’re looking for).
Most Windows extension modules link against pythonXY.dll (e.g. python39.dll) or python3.dll and will fail to load on the static distributions. Extension modules will need to be explicitly recompiled against the static distribution.
There is no supported platform tag for Windows static distributions and therefore there is no supported way to distribute binary wheels targeting the Python static distributions.
Aspects of OpenSSL (and therefore Python’s ssl module) don’t work when OpenSSL is compiled/linked statically. You will get opaque run-time errors.
Buggy binary extensions, buggy wheels, buggy openssl. Those are very big problems if you’re developing on Windows, and trying to run the code natively, without WSL or Docker. I’d go as far as saying it’s a deal breaker.
Too bad, I was kinda looking forward to trying uv out, but unless the standalone builds get a better Windows story, or uv allows for other sources, I guess I’m stuck with pyenv.
I guess you’d be able to solve most of those problems if you could compile the installed Python? I have been meaning to open an issue about that, just to know if it was on their roadmap.
Kind of out of the frying pan and into the fire situation, because then I have to compile stuff on Windows. Supporting other builds, like the official ones, for instance, would be nice.
Now I just need to wait for GitHub Actions to release it and I can release a couple of libraries’ 3.13 support. GHA having the pre-releases as version 3.13-dev is annoying enough in some cases that I can’t be bothered to release in advance.
I’d love to have seen at least one std::arch implementation (say, for ARM or x86). Yes, I get it, the portable stuff is cool and, well, portable, but still only available on nightly. Yes, the stuff on stable needs one implementation per architecture, but that also means you can optimize for each independently, which at times might generate better code (e.g. x86 makes the comparison masks 0 or 255, so one could simply subtract them from the counters. I haven’t benchmarked the difference, but I suspect it might improve perf).
There’s a bunch of implementations of this exact Mandelbrot logic here: https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/mandelbrot.html The fastest version is in Rust, but doesn’t actually use intrinsics. Given how its written I suspect it gets compiled into SIMD using LLVM’s SLP autovectorization. I am not sure how the version I used (ported from packed_simd, and very slightly slower for some reason than their version) compares in speed.
I once made a faster version (at least on my machine) than that benchmarksgame entry, also in Rust, which wasn’t accepted because Mr. Gouy didn’t want a SIMD arms race back then. Also in my bytecount crate we had a packed_simd based set of methods, and they were almost as fast on x86 but measurably slower on ARM (benchmarked on my mobile phone and an M2 macbook pro).
So while I expect the difference to be small, I suspect there will be one.
Really interesting, main thing I would like to see that isn’t there is CPU parallelim; comparing single CPU core to a GPU isn’t really a fair comparison.
As long as the /usr/sysroot/ subdirectories remain immutable and synchronized across all environments in which binaries run, we are guaranteed to always get a deterministic version of glibc for a given binary.
and then
Containers are, usually, not the best solution to a systems problem. While they might be coerced to deliver the desired results, they come with a heavy cost. Unix has been around for a long time and there exist alternate solutions to problems that don’t require full replicas of a functioning system.
So the author is saying we need a way to replicate some folders across systems and make sure they’re immutable. We could build a server that allows us to download immutable “images”, and then have a command-line tool that makes those images part of the filesystem… just need to pick a name for this concept…
To answer the question seriously (and as someone who does have store paths rather than images):
I suspect store paths can and even do waste less space than images.
I suspect that store paths can make it easier to see what all files one has installed on a system (e.g. to scan for vulnerabilities), relative to listing all files in images.
I think store paths, at least in some applications, may be a more direct, less abstracted, simpler, maybe even more sensible way than images to have multiple copies of software components in a system without their clashing with each other.
I think store paths, at least as used in NixOS, still suffer from difficulties around rapid security updates; how they compare to images in this regard I don’t know.
The way I see it, the difference lies in chroot. That’s what brings cognitive complexity to containers. Using the article, people can make an informed decision considering their own specific context. Cost/benefit, yadda yadda…
Correct. Containers are good for some things but they aren’t the only solution and they aren’t always the best solution. Furthermore, containers can be lightweight or heavyweight, and from what I observe, they tend to be much larger than they could be.
The problem I face (which I didn’t fully describe in the article) is much easier to solve in the way I described. Chroot-style solutions, like containers, could of course work… but they would make the solution much harder to manage logistically and much more inefficient.
Yeah, and the author seems to confuse “works on my machine” with “doesn’t have UB”.
FWIW, I think this is a fine use-case for inline assembly - if you want “what the machine does” semantics as opposed to Rust semantics, that’s how you can get that.
It’s also a use case for compiler builtins that define or change language semantics to whatever it is we need them to be. Rust semantics are only useful if they get the compiler to generate the code we want. If that’s not happening, it’s time to leave them behind.
I think what is needed is a process group (instead of prctl). Then the queue manager can clean up the child processes by sending a signal to the process group when its leader exits.
This involves… PTYs if a vague horrible memory I have tried to suppress is correct?
Googling leads to
setsid(), and its man page says:I am going to try to once again suppress all knowledge in this area so I’m never tempted to try to do anything involving it.
If the queue manager is the normal sort of daemon, it’ll call setsid() to create a session without a controlling terminal, so there’s no need for ptys to be involved. Call setpgid() after forking a worker; when the worker exits, send a signal to its process group to get rid of any stragglers.
I’m really eager to try out the Mono variant. I’ve been maining Iosevka Comfy (which has been renamed now I believe), but Atkinson Hyperlegible is my favorite font to read our utside of my editor!
I tried it out today. Found it too wide. Wider than probably even Source Code Pro!
Yeah I’m trying it as replacement for Source Code Pro and not sure how I feel. My code feels less graceful and more… industrial?
Yes that’s it! And the kerning…characters feel too spaced apart. That might be the intention though, for readability.
It actually does feel more readable when I switch in some ways. But also… it’s showing rectangles instead of backticks? are you seeing that too?
UPDATE: switched to the variable TTF from Google Fonts (isntead of the OTFs I was using) and that fixed it.
It shows a backtick ` correctly in VS Code.
New version has been released in 2025 with updated design, as well as a monospaced version.
Managed Kubernetes clusters have a baseline cost that’s quite high, so they saved money by running a single-machine Kubernetes distribution on a single VM, at 1/4 the cost. Which is important for a free community service as in this case, but also relevant for small businesses.
One key point I got out of this is that sampling profilers are really bad at catching intermittent latency spikes (or worst case latency, the author’s words).
Another argument for the idea that different performance use cases need different kinds of profiler measurement and different kinds of visualizations.
One useful and relevant tool: https://magic-wormhole.readthedocs.io/en/latest/welcome.html#motivation
tracemalloccount the amount of memory alloc’ed. It doesn’t tell the actual amount of memory used astracemallocdoesn’t track freed memory. Looking at the process rss over the iteration tells a very different story. As it turns out Python GC is able to free the small buffer between iterations makingreadmemory usage a non-issue.On my machine, the code above peak at
116 621 312bytes (vs116 129 792bytes without the loop).Just saw that the first snippet in the article does a single
read. I guess in that case the article is right, but it could be argued that.read()is a bit of an anti-pattern to begin with.Yeah, I was specifically writing about the case where you want to or need to do a single read(); it is the desired pattern in some cases, or sadly unavoidable without major refactoring, as in the Polars case I encountered. But if you’re doing a series of small read()s, which depending on the use case is what you want, you don’t need getvalue(). I should mention that in the article, I’ll update it.
Pretty impressive on its own, but compounded more so by the fact that he is in final year of high school. Project’s github here: https://github.com/Hello9999901/laptop
I don’t see high school anywhere in there. “time-independent and dependent Schrödinger equations” require 5 semesters of calculus, which is usually only available in university.
He is a high school senior right now, at Philips Exeter Academy. Granted, it is quite possible it’s not your average high school; here’s the author talking about the topics they’ve learnt (copied from the orange site):
It’s a very fancy private boarding school, yes.
Googled it. TIL that there is another Exeter in the USA.
If you’re using Rust from Python, https://github.com/mityax/rustimport has the benefit that you can then rely on framework-level reimports, e.g. I suspect the mechanism in Flask dev mode would also mean the Rust code gets recompiled on each request iff it’s changed.
Ah could I have been the inspiration for the “Your software’s electricity usage may not matter” section? https://lobste.rs/s/ghzqso/how_solar_minecraft_server_is_changing#c_avxn11
If so, I love inspiring disclaimer sections :)
I had not seen that, no, but clearly you made an excellent point :D
I support efforts to mitigate climate change, and I like performance optimization, but I’m deeply skeptical of this line of thinking.
The thing is that organizations already have a strong economic incentive to reduce program runtime because there are economic costs (cost of servers, cost of slower processes). I believe (though I must acknowledge I’m not citing solid evidence) that those costs dominate the cost of electricity (with or without externalities–and I would favor making those externalities explicit with carbon taxes or cap and trade).
If that’s true, then there’s no significant reason to talk about the carbon emissions–doing what is economically rational will reduce them anyway.
Of course, companies will often not do what’s economically rational, because they’re poorly managed. But most of those poorly managed companies also won’t say “forget what’s profitable, lets reduce our emissions”.
The article gives one example outside the realm of corporate money saving: widely used open source projects. Empirically some of them could be much much faster given some resources.
I am skeptical of arguments that rely on corporate rationality, because:
I once interviewed at a company that did electric demand response for companies (“we’ll pay you $X to shut down your high-electric-usage devices when power usage is surging” aka negative power plants). They also ran an energy efficiency audit service. I asked, and was told that (a) the energy efficiency audit tended to save the company more money than they made from demand response (b) companies (which devolves to managers) were much more excited about payment for demand response than they were in efficiency. So a guarantee of decreasing costs was valued less than a gamble on getting revenue.
All that being said, yes, reducing costs does sometimes push companies in the right direction on efficiency. Certainly hyperscalers have invested a lot in datacenter efficiency and will pay for developing things like a 5% faster hashmap if that means a 0.1% reduction in server costs.
By watching Not Just Bikes youtube videos I learned about induced demand in transportation: increasing road capacity does not result in reduced congestions. In response to the added capacity, more people decide to drive a car, and sit in the same and even larger congestions.
Wouldn’t something similar apply here? Making programs run faster may not result in less CPU time spent. An optimized game engine will load cut scenes faster, and make the game more addictive. A faster JS engine will allow adtech companies to send even more junk with every page load. A leaner linux kernel will allow it to run on smaller chips, and we will get IoT crap on more appliances. And so on.
I’m not saying we shouldn’t try to make things more efficient. I’m saying, this could potentially result in more computation (and energy spent), not less.
Yes, the same thing will happen. In the context of resource efficiency, the phenomenon is called the Jevons Paradox. Making a resource’s usage more efficient at a societal level causes more of it to be used, not less. Faster software won’t reduce emissions, it will just make us run more software.
I discuss this in the article (or rather, link to a more extensive article about it: https://pythonspeed.com/articles/software-jevons-paradox/). My initial take: in some cases, yes, but not all.
Induced demand only applies when people actually want something. People want to drive, so when driving becomes cheaper, more people will drive. Only because of the large quantity of people who want to drive somewhere combined with the geometry of the world, this results in congestion never being fixed.
For an example of induced demand being good, look at trains. If you make trains run more often, they may become more popular, as they are now more useful for getting between places. They may become crowded, but they’ll get to their destinations just as fast. And if the trains are getting quite crowded, well, that’s a good argument for running more trains, increasing frequency, which makes the service better yet again, and may induce even more demand. So, the service becomes better with induced demand in one axis (frequency) but not necessarily another (crowding). But there’s another crucial difference: the space savings of trains compared to cars is so great, that it actually is possible to fully meet all demand, so inducing demand until trains are too crowded isn’t inevitable, the way congestion is inevitable with cars.
Finally, we get to efficient programs and computers. More efficient programs probably would induce demand. But how much? It can only happen to the extent that there is demand for running programs currently not met because it’d be too expensive. How large is that extent? It is surely not infinite. How much more efficient can we make our programs? The question of whether we’d use more energy or less depends on the answers to those questions.
I mean, they want to get to their destination, they believe driving is the only valid way to do that, and they reason that more lanes means more ez ride.
It’s frustrating because (here in Chicago) it is absolutely true that trains are late, busses bunch up on routes, and scheduled routes are sometimes too infrequent, but folks who use them as reason not to use transit ignore the indignities they already experience while driving. Oh well.
Am I off topic? At the very least I’m looking too far into the analogy. Pardon me.
It’s almost as if you didn’t read the article at all?
I did. Your post is focused on reducing power consumption and power-efficient code writing, and to my understanding, it’s technically correct. Good stuff.
But, I got somewhat triggered by the “Reducing CO₂ emissions” title and I consider it a click-bait. My point is: writing code with low energy consumption in mind is great. But slapping CO₂ emissions-related title on it - not so much. Even if your code would be run - like curl - on most electronic devices on Earth, the impact on CO₂ emissions is - to say politely - “marginal”.
But in the aggregate what would happen if everyone managed to make their software twice as fast?
Far-fetched assumption, not backed up by benchmarks, but I’d say that when it comes to how Python is used nowadays around power-hungry tasks (making GPUs go brrrrr), there’s near-zero impact of making Python code be twice faster.
Unfunny joke: you can use
threadingand such to draw even more power from more GPUs. But seriously - while Python is not C - it’s almost never the problem when it comes to power consumption (we’re not talking IoT and code running on low-power-battery-powered devices). And I’ll raise the “Python is nowadays used to do ML stuff” over and over in this discussion.It’s not quite the same as it was back then (but until the GIL is actually gone-gone it still is an issue) but I had a really funny result doing some experiments in grad school around Python threads. This would have been around 2009.
The gist of it… I instrumented the GIL with RDTSC to keep track of lock contention while varying the number of cores that my threaded Python program could use. I made sure that I wasn’t using C libraries that release the GIL (e.g. numpy) and I made sure that my program was CPU-bound but not cross-thread mutex bound. At a coarse level I was measuring overall execution time and at a fine-grained level I was keeping track of all of the GIL request/acquire/release times so that I could look for specific behaviour patterns while varying the number of CPU cores.
My expected result was that adding extra cores would result in a slight slowdown due to cache misses and lock acquisition but otherwise would perform similarly to single-thread/single-core for processing all of the data. What I ended up seeing, though, was good performance on 1 core, similar performance on 2 cores, and dramatically worse performance on 3 cores and 4 cores. Like… a performance drop that I couldn’t explain using any kind of theoretical model I could think of. And then… I saw it.
The system I was using for benchmarking (bare metal, no VMs or containers) was configured for CPU clock scaling. When I distributed my process across 4 cores the GIL limited each of the threads to only 25% CPU usage on each core. Frequency scaling saw that the CPU was underutilized and scaled down the clock frequency. But… because of the GIL it continued to only use 25% of each core, which meant that the clock got dropped all the way down to its minimum. So, lol, adding more threads can make it potentially more energy efficient if you design your experiment sufficiently poorly!
I wrote this question-comment yesterday, left it for a next-day review, then deleted, and now I’m rewriting. Fuck it if I come off as an idiot, but I’m nerdy-curious and just have to take the risk and ask…
Was anything around this done within Python, and not on OS level? I’m interested in the “Pythonic” vs. operating system’s (sysctl etc) manner of setting stuff up, and super curious of how you both set/limited, and tested CPU usage. In detail.
EDIT: also: have you pinned these threads to particular P-cores or E-cores, or tested all this on P-cores, or a CPU with no such distinction?
It was 13-15 years ago but I’m 99.9% sure it was just done with
taskset: https://man7.org/linux/man-pages/man1/taskset.1.htmlAnd as far as I can recall the CPUs that I would have had access to at the time didn’t make a P-core/E-core distinction. It likely would have had old-style Hyperthreading though and I’m pretty sure I would have explored that a little bit but if I did the results weren’t surprising enough to write themselves to long-term memory :)
As far as how I was tracking CPU usage, it likely would have just started with some flavour of
topfollowed by likely a Python script that would do the same thing top was doing (reading stuff out of/proc?) and logging it on a relatively fine-grained interval.Finally: those are not dumb questions at all! Those, to me, indicate that you really get the kind of issues that can trip up benchmarks like this. I’d love to go back and repeat these kinds of experiments 15 years later, especially with the new “we’ve got a road to eliminating the GIL” stuff that landed recently but… no one’s paying me to do systems research anymore (or, honestly, to write Python code) so someone else will have to pick up that torch.
Instead of wagging your finger at countries that have been victimized for centuries by colonial powers, a much more productive strategy would be to loudly advocate for nuclear energy in Europe instead of whatever harebrained ‘renewable energy’ (powered by African-mined batteries) scheme they are concocting. Look to the French, they are actually decarbonizing. You can tell because the Germans are trying to penalize them for not meeting ‘renewable energy’ targets (note, not decarbonizing but renewable energy).
I expected this to be implemented using a ring buffer, so I was surprised to see:
which is a greater-than-zero-copy (and
memmoveis probably the “more efficient way” you’re looking for).Yeah, slice::copy_within is the easy way to call memmove here.
Maybe file an issue? https://github.com/mwlon/pcodec
Compilers are pretty good at vectorizing loops these days. Here’s the amd64 assembly for the loop (godbolt: https://godbolt.org/z/qxdjzcz88):
Buggy binary extensions, buggy wheels, buggy openssl. Those are very big problems if you’re developing on Windows, and trying to run the code natively, without WSL or Docker. I’d go as far as saying it’s a deal breaker.
Too bad, I was kinda looking forward to trying uv out, but unless the standalone builds get a better Windows story, or uv allows for other sources, I guess I’m stuck with pyenv.
You can still use uv for managing dependencies, you just wouldn’t use it to manage your python installation.
I guess you’d be able to solve most of those problems if you could compile the installed Python? I have been meaning to open an issue about that, just to know if it was on their roadmap.
Kind of out of the frying pan and into the fire situation, because then I have to compile stuff on Windows. Supporting other builds, like the official ones, for instance, would be nice.
Someone was talking about considering submitting PR to use official builds, on Mastodon (one of the Python core devs maybe?) so it might happen.
What’s the difference with regular builds?
See my first comment, but tl;dr the standalone builds are buggy with compiled extensions on Windows
Now I just need to wait for GitHub Actions to release it and I can release a couple of libraries’ 3.13 support. GHA having the pre-releases as version
3.13-devis annoying enough in some cases that I can’t be bothered to release in advance.I’d love to have seen at least one
std::archimplementation (say, for ARM or x86). Yes, I get it, the portable stuff is cool and, well, portable, but still only available on nightly. Yes, the stuff on stable needs one implementation per architecture, but that also means you can optimize for each independently, which at times might generate better code (e.g. x86 makes the comparison masks 0 or 255, so one could simply subtract them from the counters. I haven’t benchmarked the difference, but I suspect it might improve perf).Someone shared some numbers from doing using intrinsics, albeit in C, here: https://www.reddit.com/r/programming/comments/1fsyh5x/comment/lpssj9l/ No guarantee they did it optimally, of course.
There’s a bunch of implementations of this exact Mandelbrot logic here: https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/mandelbrot.html The fastest version is in Rust, but doesn’t actually use intrinsics. Given how its written I suspect it gets compiled into SIMD using LLVM’s SLP autovectorization. I am not sure how the version I used (ported from
packed_simd, and very slightly slower for some reason than their version) compares in speed.I once made a faster version (at least on my machine) than that benchmarksgame entry, also in Rust, which wasn’t accepted because Mr. Gouy didn’t want a SIMD arms race back then. Also in my bytecount crate we had a
packed_simdbased set of methods, and they were almost as fast on x86 but measurably slower on ARM (benchmarked on my mobile phone and an M2 macbook pro).So while I expect the difference to be small, I suspect there will be one.
Really interesting, main thing I would like to see that isn’t there is CPU parallelim; comparing single CPU core to a GPU isn’t really a fair comparison.
and then
So the author is saying we need a way to replicate some folders across systems and make sure they’re immutable. We could build a server that allows us to download immutable “images”, and then have a command-line tool that makes those images part of the filesystem… just need to pick a name for this concept…
What if rather than “images” we had “store paths” and instead of arbitrary version numbers we used some kind of hash?
To answer the question seriously (and as someone who does have store paths rather than images):
I suspect store paths can and even do waste less space than images.
I suspect that store paths can make it easier to see what all files one has installed on a system (e.g. to scan for vulnerabilities), relative to listing all files in images.
I think store paths, at least in some applications, may be a more direct, less abstracted, simpler, maybe even more sensible way than images to have multiple copies of software components in a system without their clashing with each other.
I think store paths, at least as used in NixOS, still suffer from difficulties around rapid security updates; how they compare to images in this regard I don’t know.
Container images are stored by hash and you can so reference them if you desire.
The way I see it, the difference lies in chroot. That’s what brings cognitive complexity to containers. Using the article, people can make an informed decision considering their own specific context. Cost/benefit, yadda yadda…
Correct. Containers are good for some things but they aren’t the only solution and they aren’t always the best solution. Furthermore, containers can be lightweight or heavyweight, and from what I observe, they tend to be much larger than they could be.
The problem I face (which I didn’t fully describe in the article) is much easier to solve in the way I described. Chroot-style solutions, like containers, could of course work… but they would make the solution much harder to manage logistically and much more inefficient.
Note that this is using undefined behavior, so it’s unsound and could break at any time.
https://github.com/ogxd/gxhash/issues/82 (there’s a concurring comment from Ralf Jung, who would know what he’s talking about.)
Yeah, and the author seems to confuse “works on my machine” with “doesn’t have UB”.
FWIW, I think this is a fine use-case for inline assembly - if you want “what the machine does” semantics as opposed to Rust semantics, that’s how you can get that.
It’s also a use case for compiler builtins that define or change language semantics to whatever it is we need them to be. Rust semantics are only useful if they get the compiler to generate the code we want. If that’s not happening, it’s time to leave them behind.