As a quick reminder for why Python introduced a GIL in the first place, it’s important to remember that Python predates the rise of threading as a popular programming model (Python is older than Java!), and came from the Unix-y scripting-language world where the standard solution had always been to use multiple processes.
And by the time threading became popular enough that people were demanding support for it, Python was in a tough spot because, at that point, one of its big selling points was the ecosystem. Not the ecosystem of pure Python packages, but the ecosystem of C libraries people had written Python wrappers for. Very few of which were thread-safe, and demanding that they all rewrite was not really a good idea.
Now, at that point in Python history the main thing you’d want threading for was network daemons, which — crucially — are generally I/O bound. And at that point in computing history, most people didn’t have multi-core/multi-processor hardware, so you weren’t really going to be executing multiple threads simultaneously anyway. So the GIL was added, and is a clear trade-off: it tries to minimize the overhead on single-threaded code, and on multi-threaded I/O-bound loads where threads can just release while blocking, at the cost of making multi-threaded CPU-bound loads extremely painful.
Unfortunately, a couple decades later we all have multi-core/multi-processor laptops, tablets, phones… and Python got real popular in CPU-hungry number-crunching applications like data science. What seemed a reasonable compromise at the time feels less so with the benefit of hindsight.
Likewise, especially when I realised that there’s no way to kill a system thread properly, and only hacks to kill a Python thread, none of which works if your thread is blocking on a system call.
Dumb question; but why can’t we do something about GIL if it hurts parallelism? Maybe option to remove/disable it? I think it must’ve been done somewhere.
One reason it is hard technologically is because at the moment: any operation that involves only a single Python bytecode op, OR any call into a C extension which doesn’t release the GIL or re-enter the Python interpreter, is atomic. (Re-entering the Python interpreter may release the GIL.)
This means all kinds of things are atomic operations in Python. Like dict reads/writes and list.append(), either of which may call malloc or realloc in the middle.
You can write many data race-y programs in Python that have well-defined (messy, but still well defined) semantics. I think nobody in the world has an idea of how much code there might be in the wild that (possibly accidentally) abuses this. So making data races be undefined behaviour would be quite a large backwards compatibility break, in my opinion.
You don’t want to “just” slap a mutex on every object because then the lock/unlock calls world kill performance.
I believe the PyPy developers are/were looking at shipping an STM implementation and the GILectomy fork involves a lot of cleverness of which I can remember no details.
There is an exciting new approach by Sam Gross (https://github.com/colesbury) who has made an extremely good NOGIL version of Python 3.9 (https://github.com/colesbury/nogil) It performs almost without any overhead on my 24 core MacPro test machine.
It is a sensational piece of work, especially as you mentions there have been so many other experiments. I know Sam has been approached by ThePSF. I am crossing my fingers and hope they will merge his code.
They tend to perform worse on single threaded workloads. Probably not all, but I’m quite sure that several attempts, even rather naive ones, produced multi-threaded speed ups, but at the cost of being slower when running on a single thread.
Even ideas that succeeded to improve multi thread performance got shot down because the core team believes this (slower single core for fast multi core) is not an acceptable trade off
IIRC the position was taken fairly early on by Guido that proposals to remove the GIL would not be accepted if they imposed slowdowns on single threaded Python on the order of… i think a cutoff of about 5% or 10% might have been suggested?
Because alegedly, the gain in safety is greater than that of efficiency of concurrency.
It is a reliable, albeit heavy handed, way of ensuring simple threaded code generally works without headaches. But yes, it does so by eroding the gains of multithreading to the point of questioning if it should exist at all. Arguably.
Some async libraries mimic the threading API while resoursing to lower level async primitives. Eventlet and gevent come to mind.
No, it’s about performance and a little bit about compatibility.
Most Python programs are single-threaded, and removing the GIL would not cause most of those to want to become multi-threaded, since their average Python program’s workload is not something that benefits from being multi-threaded. And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.
Secondarily, there would be a compatibility issue for things which relied on the GIL and can’t handle having the acquire/release turned into no-ops, but the performance issue is the big one.
Most of the time when a GIL removal slows down single-threaded code, it’s because of the GC. Right now Python has a reference-counting GC that relies on the GIL to make incref/decref effectively atomic. Without a GIL they would have to be replaced by more cumbersome actually-atomic operations, and those operations would have to be used all the time, even in single-threaded programs.
Swapping for another form of GC is also difficult because of the amount of existing extension code in C that already is built for the current reference-counting Python GC.
Because significant portions of the Python ecosystem are built with a GIL in mind, and would probably break the moment that GIL is removed. You’d essentially end up with another case of Python 2 vs Python 3, except now it’s a lot more difficult to change/debug everything.
A heavy-handed approach is to use multiprocessing instead of multithreading. Then each subprocess gets its own independent GIL, although that creates a new problem of communicating across process boundaries.
As a quick reminder for why Python introduced a GIL in the first place, it’s important to remember that Python predates the rise of threading as a popular programming model (Python is older than Java!), and came from the Unix-y scripting-language world where the standard solution had always been to use multiple processes.
And by the time threading became popular enough that people were demanding support for it, Python was in a tough spot because, at that point, one of its big selling points was the ecosystem. Not the ecosystem of pure Python packages, but the ecosystem of C libraries people had written Python wrappers for. Very few of which were thread-safe, and demanding that they all rewrite was not really a good idea.
Now, at that point in Python history the main thing you’d want threading for was network daemons, which — crucially — are generally I/O bound. And at that point in computing history, most people didn’t have multi-core/multi-processor hardware, so you weren’t really going to be executing multiple threads simultaneously anyway. So the GIL was added, and is a clear trade-off: it tries to minimize the overhead on single-threaded code, and on multi-threaded I/O-bound loads where threads can just release while blocking, at the cost of making multi-threaded CPU-bound loads extremely painful.
Unfortunately, a couple decades later we all have multi-core/multi-processor laptops, tablets, phones… and Python got real popular in CPU-hungry number-crunching applications like data science. What seemed a reasonable compromise at the time feels less so with the benefit of hindsight.
The more I do threading… the more I like processes.
This is one of the many reasons I love Rust, threading is safe, so you get the benefits of shared address space without the downsides.
Tiny adjustment to your statement that doesn’t invalidate what you’re saying…
Likewise, especially when I realised that there’s no way to kill a system thread properly, and only hacks to kill a Python thread, none of which works if your thread is blocking on a system call.
I really liked the evolving mental model approach of this article. Thanks for writing this.
Dumb question; but why can’t we do something about GIL if it hurts parallelism? Maybe option to remove/disable it? I think it must’ve been done somewhere.
One reason it is hard technologically is because at the moment: any operation that involves only a single Python bytecode op, OR any call into a C extension which doesn’t release the GIL or re-enter the Python interpreter, is atomic. (Re-entering the Python interpreter may release the GIL.)
This means all kinds of things are atomic operations in Python. Like dict reads/writes and list.append(), either of which may call malloc or realloc in the middle.
You can write many data race-y programs in Python that have well-defined (messy, but still well defined) semantics. I think nobody in the world has an idea of how much code there might be in the wild that (possibly accidentally) abuses this. So making data races be undefined behaviour would be quite a large backwards compatibility break, in my opinion.
You don’t want to “just” slap a mutex on every object because then the lock/unlock calls world kill performance.
I believe the PyPy developers are/were looking at shipping an STM implementation and the GILectomy fork involves a lot of cleverness of which I can remember no details.
There have been (more than) a few experiments to remove the GIL in the past 20 years. To my knowledge they end up performing worse or being less safe.
There’s a new PEP to get a more granular GIL.
There is an exciting new approach by Sam Gross (https://github.com/colesbury) who has made an extremely good NOGIL version of Python 3.9 (https://github.com/colesbury/nogil) It performs almost without any overhead on my 24 core MacPro test machine.
It is a sensational piece of work, especially as you mentions there have been so many other experiments. I know Sam has been approached by ThePSF. I am crossing my fingers and hope they will merge his code.
I’ve been struggling with a Python performance issue today that I suspected might relate to the GIL.
Your comment here inspired me to try running my code against that nogil fork… and it worked! It fixed my problem! I’m stunned at how far along it is.
Details here: https://simonwillison.net/2022/Apr/29/nogil/
They tend to perform worse on single threaded workloads. Probably not all, but I’m quite sure that several attempts, even rather naive ones, produced multi-threaded speed ups, but at the cost of being slower when running on a single thread.
Even ideas that succeeded to improve multi thread performance got shot down because the core team believes this (slower single core for fast multi core) is not an acceptable trade off
IIRC the position was taken fairly early on by Guido that proposals to remove the GIL would not be accepted if they imposed slowdowns on single threaded Python on the order of… i think a cutoff of about 5% or 10% might have been suggested?
That’s kind of what I remember too.
There are experiments underway, e.g. https://lukasz.langa.pl/5d044f91-49c1-4170-aed1-62b6763e6ad0/, and there have been previous attempts that failed.
Because alegedly, the gain in safety is greater than that of efficiency of concurrency.
It is a reliable, albeit heavy handed, way of ensuring simple threaded code generally works without headaches. But yes, it does so by eroding the gains of multithreading to the point of questioning if it should exist at all. Arguably.
Some async libraries mimic the threading API while resoursing to lower level async primitives. Eventlet and gevent come to mind.
No, it’s about performance and a little bit about compatibility.
Most Python programs are single-threaded, and removing the GIL would not cause most of those to want to become multi-threaded, since their average Python program’s workload is not something that benefits from being multi-threaded. And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.
Secondarily, there would be a compatibility issue for things which relied on the GIL and can’t handle having the acquire/release turned into no-ops, but the performance issue is the big one.
Why does this happen?
Most of the time when a GIL removal slows down single-threaded code, it’s because of the GC. Right now Python has a reference-counting GC that relies on the GIL to make incref/decref effectively atomic. Without a GIL they would have to be replaced by more cumbersome actually-atomic operations, and those operations would have to be used all the time, even in single-threaded programs.
Swapping for another form of GC is also difficult because of the amount of existing extension code in C that already is built for the current reference-counting Python GC.
Because significant portions of the Python ecosystem are built with a GIL in mind, and would probably break the moment that GIL is removed. You’d essentially end up with another case of Python 2 vs Python 3, except now it’s a lot more difficult to change/debug everything.
A heavy-handed approach is to use multiprocessing instead of multithreading. Then each subprocess gets its own independent GIL, although that creates a new problem of communicating across process boundaries.