Threads for itamarst

  1. 17

    https://twitter.com/m_ou_se/status/1557742789693427712 has an additional writeup. Using VecDeque as a Read/Write circular buffer is cool.

    1. 2

      I somehow went through most of my programming career so far (more than a decade) without even knowing that sampling profilers existed. I knew about instrumenting profilers, like Python’s profile and cProfile modules (and the old hotshot), which trace every function call, but I also knew that they added a lot of overhead and produced distorted results. I don’t remember if I finally learned about sampling profilers when I went to Microsoft in 2017, or shortly before. So I wonder if widespread awareness of sampling profilers is a fairly recent thing.

      1. 3

        Might be a Python thing? I think the most popular sampling profiler is py-spy, and it was only started in 2018 or so, and before that pyflame introduced the idea but it was started in 2016. I forget what was in use before that… possibly nothing?

        But back in 2005 or 2006, when writing C++, I was taught about whatever the full-system Linux profiler was back in the day, before perf came around.

      1. 3

        Hrm… I’m not seeing a significant difference in the demos (to be fair, I run a fairly non-standard Firefox + Linux setup which may not have been tested), and I soon realized the main hero “example” is actually a pre-made video 🤨

        1. 2

          I had one demo go faster with normal SVG, one made no difference, third was faster with SSVG. Firefox on Linux.

        1. 2

          !!! I did not know that existed, that’s really cool.

          1. 1

            I think all the points raised in this article are reasonable. There are two big problems with it, however:

            1. The only bottleneck you can observe in production is the one you’re already hitting. In my line of work that is often much too late.
            2. Having identified a putative bottleneck, you lack a good way to confirm your hypothesis. You can fix the problem, which will be wasted effort if you were wrong, or try to make the problem worse, which will make your customers angry if you were right.

            I would suggest that the best way to find performance bottlenecks is to build a scale model of production using exactly the same specifications for everything, and then test it.

            1. 2

              I imagine specific problem domains do end up with different best approaches, yeah, so I’d love to hear what your specific line of work is; for context my main focus on the site is batch jobs / data processing pipelines.

              If things you care about are e.g. “we hit 2x the visitors unexpectedly and now there’s a cascading backlog in the all the website’s backend services” then yeah you want to prevent that, not diagnose it.

              1. 1

                I would suggest that the best way to find performance bottlenecks is to build a scale model of production using exactly the same specifications for everything, and then test it.

                I believe we have established, pretty authoritatively, that it is not possible to effectively model “production” of any nontrivial application in a staging or test context. The real-world interactions of the uncountably many components that formulate your prod system are simply too many to make the cost of modeling those things less than the value that those models deliver. This is the root of the “test in prod” line of thinking that underlies the current, highly effective!, model of observability.

              1. 4

                I have Caps lock set to compose key, and then à is Caps lock, a, backtick (`). Need to memorize a bunch of compose short cuts though, but it’s also available everywhere on my Linux desktop, not just emacs.

                1. 2

                  Yes, compose is much more powerful. I love how it makes very easy to remember to type characters for most Scandinavian and Latin languages. Even typing ™ is just Compose+T, M.

                  Long time ago, I contributed some patches to an Emacs fork for macOS similar to what the OP posted, but I still prefer compose.

                1. 2

                  For what the OP calls “shuffling hashes”, there’s at least two use cases with different goals. If you want multiple processes to get the same hash over time and space (a distributed system and/or hashes stored persistently) you want a hash with consistent output, like highway hash. In contrast, if your hash is an implementation detail of a single-process in-memory hash map, say, you don’t care if its representation changes over time. https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md has a good discussion of this.

                  https://en.wikipedia.org/wiki/Metaphone is a family of soundex alternatives. https://en.wikipedia.org/wiki/Category:Phonetic_algorithms lists a few more.

                  1. 2

                    If you want multiple processes to get the same hash over time and space (a distributed system and/or hashes stored persistently) you want a hash with consistent output, like highway hash. In contrast, if your hash is an implementation detail of a single-process in-memory hash map, say, you don’t care if its representation changes over time.

                    Indeed this is correct, and I’ll have to think about how to express this in the context of the article, without getting bogged down into implementation details… (Because all algorithms linked there, including the keyed ones, are deterministic. It is just a matter of which / how you use them.)

                    Thanks for the feedback.

                  1. 1

                    The “mean” example falls flat – if numpy didn’t support “mean” itself, you could use the fact that it supports “sum” and then do the division operation. It would be slightly slower, of course, but you would still get the benefits from vectorization.

                    This is one reason why “it’s rare”.

                    It might make sense to add this to “ways to work around”: see if your operation can be broken into multiple supported vectorized operation. For example, if you needed to calculate stddev, you could break it down by:

                    • Get sum
                    • Get mean
                    • Get squares
                    • Get sum of squares
                    • Subtract
                    • Get square root
                    1. 2

                      Edit: Thanks for the feedback! I flipped the order of the first and second problems and updated the Numba example, and I think now the narrative makes a lot more sense.

                      1. 1

                        Thanks – this does make more sense (especially the “not even one copy needed” advantage of numba).

                        Follow-up question: at the top, you link to an article covering SIMD vectorization in recent NumPy versions. Does Numba JIT take advantage of SIMD?

                        1. 1

                          My understanding is that it relies on LLVM to do the heavy lifting, so it will get SIMD in cases where LLVM is smart enough to auto-vectorize. So, sometimes, yes. See https://numba.readthedocs.io/en/stable/user/faq.html?highlight=simd#does-numba-vectorize-array-computations-simd and the following question.

                    1. 3

                      Does anyone know how it compares to https://stork-search.net/ ?

                      Seems quite similar at first glance, except for MIT vs GPL, both seem to use rust and wasm and focus on static sites.

                      1. 7

                        They’re both pretty good search products and you’d be fine choosing either in most cases. Stork has obviously been around for longer and so has some extra polish in that regard. The main advantage (and raison d’etre) with Pagefind is that it uses considerably less bandwith as it only loads the portions of the index that it needs to complete a search whereas Stork loads the entire index up front

                        Stork advantages:

                        • Can be used on content other than html
                        • Stemming can be set for languages other than English and on a per file basis
                        • Result ranking boosts exact matches and down-weights prefix matches and then stop words
                        • Apache licensed

                        Pagefind advantages:

                        • Easier to setup for the common case of static site generator (just point at your output dir and go)
                        • Tweaking is done without a seperate config file
                        • Uses considerably less bandwith
                        • MIT licensed

                        Some areas in which I think both could improve:

                        • Neither of them use BM25 or TFIDF for ranking. BM25 is industry standard for first stage ranking and TFIDF is the okayish ranking that most hobbyists will come across. Either would make stop words obsolete also
                        • Neither do language detection for deciding on the stemmer (fairly easy to do with trigram statistics)
                        • Neither of them do query expansion
                        • They’re both fast largely on account of being in Rust but there is room for better performance by reducing allocation during indexing, using a different index structure for search (easier in Storks case than Pagefind due to how the chunking constrains choices), and by the algorithm for evaluating and merging results lists during the search
                        • There’s still further room available for shrinking the index size in both of them

                        But for the target use case of blogs and small to medium static websites either would likely be fine

                        1. 2

                          Wow, thanks for this exhaustive comparison!

                          1. 2

                            Wow, yes, fantastic write-up. I should definitely add a roadmap to the Pagefind documentation, as there are quite a few relevant things in our short-term plans.

                            One of the imminent features to release is multilingual support, which much of the piping is in for already. My intention is to take a shortcut on the trigram statistics angle, and make use of the HTML metadata. If output correctly, a static site should have a language attribute on the HTML (or otherwise detectable through the URL structure). Using this, Pagefind can swap out the stemmer for that page as a whole. The plan is then that the index chunks would be output separately for each language, and in the browser you choose which language you’re searching up front. In our experience, it isn’t common to want to search all languages of a multilingual website at once. This should be out in a few weeks, and would give you multilingual search still without a configuration file.

                            I haven’t documented Pagefind’s result ranking fully, but it currently should be boosting exact matches and down-weighting prefix matches, which is then combined with a rudimentary term frequency. Medium term I plan to add TFIDF — I have a rough plan for how to get the data I need into the browser without a large network request. Unsure on BM25.

                            Query expansion is hard in a bandwidth sense, as most of the information isn’t loaded in. I do want to experiment with loading in a subset of the “dictionary” that was found on the site (likely the high ranking TFIDF words) and providing some spell checking functionality for those words specifically, if I can do it in a reasonable bandwidth footprint.

                            Speed is something to revisit, bandwidth has been the full priority thus far. I would be keen to hear any thoughts you have on shrinking the index size, though — I’m sure you’ve looked into it already but I have exhausted my current avenues of index shrinkage :)

                            1. 4

                              Thanks :)

                              That sounds like a good plan for multilingual. I agree that it isn’t common to search all languages at once. Hopefully the solution you describe can also be integrated into the websites language selector so that it can remain completely transparent to the user

                              That’s good to hear. Sorry that I missed the down-weighing of prefix matches when I was reading the code. If you are implementing TFIDF I highly recommend BM25 as it gives better results with mostly only a formula change. But there seems to be no way to get better than BM25 without extra ranking factors or machine learning http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf

                              I’m assuming with the extra network weight of TFIDF you’re referring to having to have all document IDs in the metadata so that you can compute the ranking without requesting the full document bodies? For the term frequency part you should be able to just use an extra byte in every position in the postings list which shouldn’t be much overhead on a per chunk basis. There’s no point using more than a byte as knowing that there’s more than 255 instances of “the” in a document is really minuscule diminishing returns

                              For the dictionary front you could investigate Levenshtein distance. It would allow you to spell check using only the chunks you’ll have already fetched. Typically the first and last letters of a word will be typed correctly and in the middle will either be a transposition, addition, or removal and likely only one such. I haven’t investigated the state of the algorithms to do that though https://en.wikipedia.org/wiki/Levenshtein_distance

                              Query expansion proper is very hard and normally is done by mining query logs. General purpose thesauri typically give bad results. And domain specific ones are expensive to create. I’m not sure what the solution there is or if it’s worth covering at all. If you did implement it I would imagine a thesaurus at the start of every chunk covering the words included which should be minimal network overhead

                              Are you doing run length encoding for the postings list yet? I didn’t check sorry. Doing that with group varint, vbyte or simple8 compression will save you the most. You might also want to look into Trie structures which would allow you to compress your terms list considerably and still perform prefix search. As a note I wouldn’t recommend B-Tree structures greater than a depth of 2 (which is how you’ve already implemented Pagefind’s index anyway)

                              For speed. Two easy things. 1. Sort the postings lists on length before merging and merge from shortest to longest. This allows you to skip as much as possible when increasing the comparison pointers. 2. Have the document parser return an iterator for then indexer to use so that you’re not allocating and deallocating all the structures required to temporarily hold the document. Not searching to completion would also speed it up, but I’m not sure that it’s a feature for a small site

                              To improve result accuracy you might also want to consider keeping a second index for the titles of pages and boost rankings on those. Quite often people are just wanting to find a specific page again when they search

                              And finally a question. How much improvement does stemming give if you also support prefix search?

                              1. 2

                                Hopefully the solution you describe can also be integrated into the websites language selector so that it can remain completely transparent to the user

                                That’s the goal — one potential path is that the search bundle is output for each language directory, so the site would load /fr-fr/_pagefind/pagefind.js and get a search experience specifically for that language. Some degree of this will need to be done, as the wasm file is language-specific (I’m avoiding loading in every stemmer)

                                Thanks for the tips on term ranking — also funny that you link to an Otago University paper, that’s where my CS degree is from :) (though I didn’t study information retrieval). The extra byte plan sounds like a good strategy.

                                It would allow you to spell check using only the chunks you’ll have already fetched

                                The reason I have been investigating spellcheck with an extra index is that the chunks as they exist now are difficult to suggest searches from, since the words are stored stemmed. Many words stem down to words that aren’t valid (configuration -> configur) so that doesn’t give me enough information to show some helper text like Showing results for comfigure configure.

                                Thesaurus at the start of each chunk would be alright on the network, but if those words were then incorporated into the search we would need to load the other chunks for those words, which would make the network requests heavier across the board unless they were only used in a “no results” setting.

                                Are you doing run length encoding for the postings list yet?

                                Every index that Pagefind spits out is manually gzipped, and gunzipped in the browser from the Pagefind js. It’s been quite a cheap way to get RLE for “free”. I did some brief experiments early on with being smarter about the data structures, but nothing beat a simple gzip. Doing it manually also means that you aren’t reliant on server support, and the compressed versions happily sit in caches.

                                Great tips on speed — I’ll definitely look into those.

                                To improve result accuracy you might also want to consider keeping a second index for the titles of pages

                                I have some plans here to add some generic weighting support. Ideally I can implement something more configurable, with a data-pagefind-weight="4" style tag that can be wrapped around any content, which would provide the ability to add title ranking. I haven’t done much R&D on this yet, but the loose plan is to investigate adding some marker bytes into the word position indexes that can signify the next n words should be weighted higher / lower, without having to split out separate indexes.

                                And finally a question. How much improvement does stemming give if you also support prefix search?

                                Great question! For partially-typed words, not a lot — the prefix search handles that well. For full words stemming provides a rudimentary thesaurus-like search, in that configuration and configuring will both stem down to configur and match each other. Additionally, storing words against their stem makes for smaller indexes, since we don’t need to allocate every version of configur* in the index.

                                These are great questions and tips, thanks for the detailed dig! I’ve been tackling this from the “I want to build search, lets learn information retrieval” side, rather than the “I know IR lets build search”, so there are definitely aspects I’m still up-skilling on :)

                                1. 1

                                  That’s the goal — one potential path is that the search bundle is output for each language directory, so the site would load /fr-fr/_pagefind/pagefind.js and get a search experience specifically for that language. Some degree of this will need to be done, as the wasm file is language-specific (I’m avoiding loading in every stemmer)

                                  Brilliant. That sounds like it’ll be nice and ergonomic

                                  Thanks for the tips on term ranking — also funny that you link to an Otago University paper, that’s where my CS degree is from :) (though I didn’t study information retrieval). The extra byte plan sounds like a good strategy.

                                  If you look at the literature for performance and compression, he does very well. There was a fairly recent comparison published for academic open source search engines. A shame you didn’t take the paper as not many universities teach search engines

                                  The reason I have been investigating spellcheck with an extra index is that the chunks as they exist now are difficult to suggest searches from, since the words are stored stemmed. Many words stem down to words that aren’t valid (configuration -> configur) so that doesn’t give me enough information to show some helper text like Showing results for comfigure configure.

                                  Seeing as you’re already scanning them with the prefix search you could store them unstemmed and stem on search. Though you might lose in your postings compression the same weight as doubly storing the words would give. Don’t know… would have to test

                                  Or you could stem the misspelled word, then fix, and silently add the fixed stemmed version to the query. As you’re already doing prefix search you’re gonna get a bunch of results that are good quality but don’t match the query literally anyway

                                  Thesaurus at the start of each chunk would be alright on the network, but if those words were then incorporated into the search we would need to load the other chunks for those words, which would make the network requests heavier across the board unless they were only used in a “no results” setting.

                                  You’d want to include the words/postings found from the thesaurus in the same chunk as the original term as you’re adding them to the query anyway. But yeah not worth talking too deeply about a feature which won’t be worth implementing

                                  Every index that Pagefind spits out is manually gzipped, and gunzipped in the browser from the Pagefind js. It’s been quite a cheap way to get RLE for “free”. I did some brief experiments early on with being smarter about the data structures, but nothing beat a simple gzip. Doing it manually also means that you aren’t reliant on server support, and the compressed versions happily sit in caches.

                                  You’ll find that small integer compression on top of RLE will compress a lot better than GZIP even including the weight of the decompressor. GZIP is a decent general purpose compressor but it can’t beat something that’s specialised

                                  I have some plans here to add some generic weighting support. Ideally I can implement something more configurable, with a data-pagefind-weight=“4” style tag that can be wrapped around any content, which would provide the ability to add title ranking. I haven’t done much R&D on this yet, but the loose plan is to investigate adding some marker bytes into the word position indexes that can signify the next n words should be weighted higher / lower, without having to split out separate indexes.

                                  Sounds like a neat solution. I haven’t experimented with position indexes myself, but bigram chaining is another implementation of phrase searching and may compress better (or worse). Worth being aware of if you weren’t already

                                  Great question! For partially-typed words, not a lot — the prefix search handles that well. For full words stemming provides a rudimentary thesaurus-like search, in that configuration and configuring will both stem down to configur and match each other. Additionally, storing words against their stem makes for smaller indexes, since we don’t need to allocate every version of configur* in the index.

                                  I’ve found in my experience that a lot of words have a stem which is also a word (except when you use the snowball stemmers of course) which can often be the form that users enter in to the search box. And that most articles which talk about configuration will also talk about configuring. But I’m also working on web search and not a product for individual sites so there’s a different precision/recall tradeoff

                                  Another interesting thing about Trie structures is you can use their branching factors to find stems. I haven’t tested this in a search engine context though so I’m not sure if it’s better or worse than snowball. But might be worth playing with https://github.com/takuyaa/yada

                                  I’ve been tackling this from the “I want to build search, lets learn information retrieval” side, rather than the “I know IR lets build search”, so there are definitely aspects I’m still up-skilling on :)

                                  It’s a fun journey and it’s always good to see more people on it :)

                                  1. 1

                                    Amazing resources, thanks. I’ll definitely be revisiting these comments in the future.

                                    Cheers for the great discussion :)

                                    1. 1

                                      No problem! I very much enjoyed it as well :)

                          2. 3

                            My impression is that it will use less bandwidth compared to Stork.

                            1. 2

                              Yes, bandwidth is the leading differentiator here. If you look at Stork’s test site here, the index for 500 pages is nearly 2MB after compression, and the wasm itself is 350KB.

                              The Pagefind XKCD demo has 2,500 pages. The exact bandwidth depends on how much you search, but a simple search can come in at around 100KB including the wasm, js, and indexes.

                            1. 3

                              Over the course of a year, I have visitors from almost every country in the world to https://pythonspeed.com. Even on a daily basis I get a bunch of visitors from countries where it’s pretty expensive to pay bandwidth. I feel a little guilty about the web fonts, though I tried to make them small.

                              All the search solutions I’ve found have either used a bunch of bandwidth, or didn’t even bother talking about it, to the point where I assumed they’d just use lots out of not-caring.

                              But this seems very promising:

                              “For a 10,000 page site, you can expect to perform a single-word search with a total network payload under 300KB — including the Pagefind javascript and webassembly libraries.”

                              1. 7

                                Author here — it’s been a great couple of weeks hearing from people that I’m not alone in the frustrations I have had with picking a search tool. Hopefully it would come in closer to 100KB for you — that 300KB figure is from my testing on a clone of MDN

                              1. 4

                                There’s examples and somewhat more useful explanations in the readme: https://github.com/jakobeha/mini-rx

                                1. 6

                                  https://www.keyvalues.com/culture-queries has an excellent set of questions that are tuned to different people’s needs; different people want different things.

                                  1. 12

                                    Apple calls these “Silicon** CPUs just to throw in a little meaningless terminology confusion

                                    Not to be overly pedantic, but they don’t call them “Silicon”, they call them “Apple Silicon” — as in “silicon made by Apple”, not “Silicon, the Apple product”.

                                    1. 1

                                      Thanks, fixed (and fixed the weird formatting too, I’m still on the fence about emacs smart parens, sometimes it’s great sometimes it messes up my markdown).

                                    1. 1

                                      Or you could statically link against musl and run successfully on all those systems.

                                      1. 1

                                        For executables, yeah. In the Python world, extensions are shared libraries, so you need to build version matching host system’s libc.

                                      1. 2

                                        Eshell was always frustrating so I tend to always use ansi-term when inside Emacs, but even that one has issues because of how wonky the terminal is with evil-mode for me (difficulties escaping insert mode, or difficulties coming back into focus with insert mode).

                                        Very cool patch and I hope others get use out of it.

                                        1. 10

                                          Try https://github.com/akermu/emacs-libvterm, it’s much better than ansi-term.

                                          1. 4

                                            Makes sense, the interop with evil mode is one of the reasons I use eshell over a different terminal emulator.

                                            This change also speeds up ansi-term because that also uses a PTY to communicate with emacs.

                                            I probably should’ve emphasized that more in the blog post originally, since it’s really improving subprocess output handling across all of emacs.

                                          1. 1

                                            Docker + debian:oldoldstable would be more straightforward, no?

                                            1. 2

                                              Old images is the standard in Python world, yes (usually with CentOS 7-based images). Except when you need more modern toolchains this gets to be a pain. Like, I’m doing cross-language LTO between Rust and C, so I need clang 13, and you can’t get that for really old distros…

                                              (But the linked technique doesn’t seem viable for Rust, so doesn’t actually help me).

                                              1. 1

                                                Yep fair enough, that makes sense.

                                            1. 4

                                              Via Twitter, someone tried a more realistic bit of code, got a nice speedup from 1.2s to 0.8s: https://gist.github.com/llimllib/11d4707e70738db05ee5843a2b81a35c

                                              Not a huge win, true, but if you can make the top-100 Python libraries all run 30-50% faster the impact would be significant overall, given the scale of Python deployments.

                                              1. 6

                                                The more I do threading… the more I like processes.

                                                1. 3

                                                  This is one of the many reasons I love Rust, threading is safe, so you get the benefits of shared address space without the downsides.

                                                  1. 4

                                                    so you get the benefits of shared borrowed address space

                                                    Tiny adjustment to your statement that doesn’t invalidate what you’re saying…

                                                  2. 3

                                                    Likewise, especially when I realised that there’s no way to kill a system thread properly, and only hacks to kill a Python thread, none of which works if your thread is blocking on a system call.

                                                  1. 2

                                                    Dumb question; but why can’t we do something about GIL if it hurts parallelism? Maybe option to remove/disable it? I think it must’ve been done somewhere.

                                                    1. 14

                                                      One reason it is hard technologically is because at the moment: any operation that involves only a single Python bytecode op, OR any call into a C extension which doesn’t release the GIL or re-enter the Python interpreter, is atomic. (Re-entering the Python interpreter may release the GIL.)

                                                      This means all kinds of things are atomic operations in Python. Like dict reads/writes and list.append(), either of which may call malloc or realloc in the middle.

                                                      You can write many data race-y programs in Python that have well-defined (messy, but still well defined) semantics. I think nobody in the world has an idea of how much code there might be in the wild that (possibly accidentally) abuses this. So making data races be undefined behaviour would be quite a large backwards compatibility break, in my opinion.

                                                      You don’t want to “just” slap a mutex on every object because then the lock/unlock calls world kill performance.

                                                      I believe the PyPy developers are/were looking at shipping an STM implementation and the GILectomy fork involves a lot of cleverness of which I can remember no details.

                                                      1. 6

                                                        There have been (more than) a few experiments to remove the GIL in the past 20 years. To my knowledge they end up performing worse or being less safe.

                                                        There’s a new PEP to get a more granular GIL.

                                                        1. 11

                                                          There is an exciting new approach by Sam Gross (https://github.com/colesbury) who has made an extremely good NOGIL version of Python 3.9 (https://github.com/colesbury/nogil) It performs almost without any overhead on my 24 core MacPro test machine.

                                                          It is a sensational piece of work, especially as you mentions there have been so many other experiments. I know Sam has been approached by ThePSF. I am crossing my fingers and hope they will merge his code.

                                                          1. 9

                                                            I’ve been struggling with a Python performance issue today that I suspected might relate to the GIL.

                                                            Your comment here inspired me to try running my code against that nogil fork… and it worked! It fixed my problem! I’m stunned at how far along it is.

                                                            Details here: https://simonwillison.net/2022/Apr/29/nogil/

                                                          2. 6

                                                            They tend to perform worse on single threaded workloads. Probably not all, but I’m quite sure that several attempts, even rather naive ones, produced multi-threaded speed ups, but at the cost of being slower when running on a single thread.

                                                            Even ideas that succeeded to improve multi thread performance got shot down because the core team believes this (slower single core for fast multi core) is not an acceptable trade off

                                                            1. 4

                                                              IIRC the position was taken fairly early on by Guido that proposals to remove the GIL would not be accepted if they imposed slowdowns on single threaded Python on the order of… i think a cutoff of about 5% or 10% might have been suggested?

                                                              1. 1

                                                                That’s kind of what I remember too.

                                                          3. 4

                                                            There are experiments underway, e.g. https://lukasz.langa.pl/5d044f91-49c1-4170-aed1-62b6763e6ad0/, and there have been previous attempts that failed.

                                                            1. 3

                                                              Because alegedly, the gain in safety is greater than that of efficiency of concurrency.

                                                              It is a reliable, albeit heavy handed, way of ensuring simple threaded code generally works without headaches. But yes, it does so by eroding the gains of multithreading to the point of questioning if it should exist at all. Arguably.

                                                              Some async libraries mimic the threading API while resoursing to lower level async primitives. Eventlet and gevent come to mind.

                                                              1. 2

                                                                No, it’s about performance and a little bit about compatibility.

                                                                Most Python programs are single-threaded, and removing the GIL would not cause most of those to want to become multi-threaded, since their average Python program’s workload is not something that benefits from being multi-threaded. And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.

                                                                Secondarily, there would be a compatibility issue for things which relied on the GIL and can’t handle having the acquire/release turned into no-ops, but the performance issue is the big one.

                                                                1. 2

                                                                  And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.

                                                                  Why does this happen?

                                                                  1. 5

                                                                    Most of the time when a GIL removal slows down single-threaded code, it’s because of the GC. Right now Python has a reference-counting GC that relies on the GIL to make incref/decref effectively atomic. Without a GIL they would have to be replaced by more cumbersome actually-atomic operations, and those operations would have to be used all the time, even in single-threaded programs.

                                                                    Swapping for another form of GC is also difficult because of the amount of existing extension code in C that already is built for the current reference-counting Python GC.

                                                              2. 2

                                                                Because significant portions of the Python ecosystem are built with a GIL in mind, and would probably break the moment that GIL is removed. You’d essentially end up with another case of Python 2 vs Python 3, except now it’s a lot more difficult to change/debug everything.

                                                                1. 2

                                                                  A heavy-handed approach is to use multiprocessing instead of multithreading. Then each subprocess gets its own independent GIL, although that creates a new problem of communicating across process boundaries.

                                                                1. 3

                                                                  I would like a convention-over-configuration framework for Rust.

                                                                  Also, don’t have have time to take a real look at it right now.

                                                                  But wow, that name is super cute.

                                                                  1. 4

                                                                    The blog post is posted now and locked in time now. Time continues forward always. This article diverges starting now.

                                                                    Frameworks sometimes have upgrade guides going from version to version. Unless someone packages this blog post as a tool, with generators and CLIs, it’s copy-and-paste which is forking and bit-rot start.

                                                                    Even generators have a tricky problem of revisiting vs heirloom configs. If I generate a project using the FooFramework on 1.0, follow the upgrade path from 2.0 -> 3.0 -> 4.0. What do I expect to happen? I generated 1.0. Now the world is on 4.0. Who tells me where I’m at with my mix of libraries and decisions? Deprecation warnings along the way? Even the most battle-hardened frameworks, docs and communities have bit-rotted comments in this situation.

                                                                    I would not want to be running “stuff from a blog post 0.0.0”. However, that doesn’t mean this blog post is bad. It’s got the steps and recommendations (which are valuable). Next step, make a script or a template for people to use. But now it’s heading towards a framework.

                                                                    It’s like script iteration:

                                                                    • Write down the steps, curate things, collect knowledge
                                                                    • Put those steps in scripts
                                                                    • Polish the scripts into programs
                                                                    • Apply software rigor etc

                                                                    I like all the Rails copycats. It’s good for everyone. Next, Redwood, Blitz, Remix are all very familiar. Would be great to see a low level language pull this off. My current thoughts are that it can’t be done for whatever reason or I’m wrong and it just hasn’t been done yet. Very hard to do tippy-top abstractions in assembly. Rocket and Buffalo (Go) are the nearest I’ve tried.

                                                                    Someone on twitter said something like

                                                                    rails for rust? be careful what you wish for

                                                                    I think if the idea is solid, implementations will converge.

                                                                    1. 5

                                                                      As an aside, there is one generator/template-builder/cookiecutter alternative that actually supports updates: https://copier.readthedocs.io/en/stable/updating/

                                                                      1. 1

                                                                        Very cool. Bookmarked.

                                                                        When updating, Copier will do its best to respect your project evolution by using the answers you provided when copied last time. However, sometimes it’s impossible for Copier to know what to do with a diff code hunk. In those cases, you will find *.rej files that contain the unresolved diffs. You should review those manually before committing.

                                                                        Mac does this on updates. It makes a directory on the desktop called Relocated Items full of files it doesn’t know what to do with. Redhat/CentOS does .rpmsave (iirc), debian/ubuntu does .dpkg-old.

                                                                        Note that this is only a problem with code comments (not annotations) that don’t execute. There’s a way to verify you don’t have regressions on upgrades: testing on many levels. And deprecation warnings (annotations). But not docs and comments (without some hoops).

                                                                        I tried to update my PC bios recently (excuse the example) and I tried to go from A -> C. It said my .zip was corrupted. I had to go from A -> B -> C. And it worked. Same file. It can’t even error cleanly. A has no idea what the file C from the future is. B had a breaking change.

                                                                        1. 1

                                                                          Interesting. I have my own Cookiecutter clone. Maybe I should start dumping answers to interactive questions to the user cache directory so that it can be rerun easily.

                                                                      2. 3

                                                                        I tried building one years ahead, but wasn’t able to find collaborators. Had a good name though: gerust.