1. 74
  1.  

  2. 29

    This is a great post. I’m now going to be unfair and focus on one tiny part of it to criticize.

    …a mass insertion of b’’ prefixes everywhere… would require developers to think about whether a type was a bytes or str…

    In addition, …the added b characters would cause a lot of lines to grow beyond our length limits and we’d have to reformat code.

    This is a great example of the kind of brain damage fostered by style guides: we start to think concerns of style are as important as concerns of semantics (the first sentence above). Would it really be so bad to wait until the migration is done to perform a final reformatting? Just increase the line size limit until then in any linting hooks.

    I don’t mean to criticize just this post. We all make penny-wise pound-foolish decisions like this everyday. Myself included. Every rule we introduce has a cost, in loss of discretion and atrophying of individual judgement.

    1. 5

      The problem is that especially in large organizations there really is value in simply picking a set of style guidelines and then requiring that any checked in code adhere to them.

      This isn’t about bike shedding either. When I started at my current job we had HUGE bodies of Python code written by folks who decidedly weren’t Python programmers. It used inconsistent tab spacing, didn’t follow variable naming conventions and generally ignored Python’s style conventions entirely.

      The net result was that the code was MUCH harder to read and maintain by the long time Pythonistas on the team, and nearly impossible for newbies to modify.

      1. 2

        I agree with all that in isolation, but I don’t think it counters anything I said about balancing syntax and semantics.

        Yes, if you leave style totally flexible, really large teams will have a really bad time. But there’s a wide spectrum here. There is always some value, but there are also costs to balance with other priorities. You don’t have to be perfectly rigid 100% of the time.

        1. 1

          You don’t have to be perfectly rigid 100% of the time.

          Anybody who is perfectly rigid with any style guide is Just Plain Doing it Wrong :)

          1. 3

            Orwell’s final rule: “Break any of these rules sooner than say anything outright barbarous”.

      2. 2

        This is why I use a 79 character line limit.

        That’s a joke, obviously, the real reason is much worse.

        1. 1

          I’m confused. Are there no auto-formatting tools for python that can wrap the lines for you? That seems the ideal fix.

        2. 12

          Guido’s retrospective mirrors many of the points in your article. While this won’t fix past mistakes, we can at least be reasonably confident that the same mistakes won’t be repeated with “Python 4”.

          Regarding the Unicode changes, as a user I’ve seen many cases of UnicodeDecode error: \xef out of range (or whatever it was) errors, and as a developer I fixed many of them in my own programs as well. Python 3 really does make things easier here. I appreciate it’s not useful for Mercurial specifically, and that the current stdlib usage may introduce some problems, but it also solved a lot of them. And it seems to me that the stdlib problems are fixable(?) Or are the Python maintainers unwilling to do so?

          Personally, I rather like Go, and use it for most places where I previously used Python. It has somewhat similar (though not identical) design ethics: most of The Zen of Python applies equally well to Go, perhaps sometimes even more so than Python. Rust, on the other hand, seems more similar to Ruby’s design ethics, which is not a necessarily a bad thing; I worked with Ruby for several years and liked it. It’s just different.

          1. 3

            I’ve thought for a while that Go is to Python as Rust is to Ruby. It’s nice to see someone else say it.

            1. 2

              I agree. I also think Go is to Java as Rust is to C++, and it’s funny that both analogies work.

            2. 2

              I just wish that python had as good of a binary packaging story as Go has. If I could build a python binary that was system specific like Go. I don’t think I would really be tempted to switch. But that packaging story makes me right a lot of infrastructure tools I would have traditionally written in Python in Go now because carrying python around is such a chore.

              1. 2

                Could you use http://www.pyinstaller.org/ ? I haven’t used it myself, but I did use py2exe back in the day to ship a bunch of internal python utilities.

                1. 2

                  The article addresses this. Mercurial will be adopting PyOxidizer for distribution, and you can too.

                2. 1

                  Personally, I rather like Go, and use it for most places where I previously used Python. It has somewhat similar (though not identical) design ethics: most of The Zen of Python applies equally well to Go, perhaps sometimes even more so than Python. Rust, on the other hand, seems more similar to Ruby’s design ethics, which is not a necessarily a bad thing; I worked with Ruby for several years and liked it. It’s just different.

                  I see this a lot, and we’re into the realm of inherently subjective personal here - but Go feels so different to me that I find it hard to grasp the comparison.

                  Go’s level of abstraction is much closer to what C feels like to me. I’m back to worrying about making errors in code that I must rewrite myself that would be handled by Python’s batteries included philosophy.

                  I’m glad Go makes you happy, and I hope one day to feel the love, but I’m not nearly there yet and still find Python to be far and away my language of choice for day to day work.

                3. 9

                  I find that the Python 3 languages changes were courageous (not in the Apple way) and the core Python group did listen to the community and reintroduced features that eased the transition in 3.3 and 3.5. Meanwhile, lots has happened in the language, from list and iterator comprehension, to a pretty good async foundation layer and supporting syntax, and now best in class typing annotations. I mean, look at Ruby in comparison. It’s fair to say that the Py3K transition has taken a long time, but I think it actually happened quite well, with good communication between the community and the maintainers.

                  1. 7

                    I had a laugh reading this sentence:

                    When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rdparty packages.

                    1. 9

                      Wow I really sympathize with all of this. I can see how poor a fit Python 3 is for Mercurial, having ported Oil from Python 2 to 3, and then back to 2 again, mainly for the reason of strings. (That was early in the project’s life and it didn’t have users, so it was easy.)

                      I agree with all this:

                      the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications

                      We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started

                      I think we talked about this before, and you mentioned PyOxidizer, which I again saw in the post.

                      But after reading all this exhausting effort, I’m still left thinking that it would have been less effort and you would have a better result if Mercurial had bundled Python interpreter.

                      I feel like that’s 10x less work than I read in the post, and it would have taken 10x less time, and you would have the better model of UTF-8 strings.

                      It doesn’t matter if distros don’t package Python 2 – because it can be in the Mercurial tarball. (People keep invoking “security” but I think that’s a naive view of security. If someone asks I’ll dig up my previous comment on that. Also I don’t think the Python 2.7 codebase is that hard to maintain and you can get rid of > 50% of it.).

                      I guess it doesn’t matter now, but honestly reading the post confirmed all the feelings I had and I personally would have abandoned such an effort years in advance. Despite my love for Python, and using it for decades, it’s not a stable enough abstraction for certain applications, including a shell and a version control system (e.g. ask me about my EINTR backports). To be fair, very few languages are suitable for writing a POSIX-compatible shell – e.g. I claim Go can’t and won’t be used for this task because of its threaded runtime.

                      In fact I used to be a Mercurial user and I remember getting ImportError because of some “fighting” over the PYTHONPATH that distros and the numerous layers of package managers do. I still use distutils and tarballs for important applications because it’s stable. I use virtualenv reluctantly, and try avoiding pip / setuptools (and I roll my eyes whenever I hear about a Python package manager that adds layers rather than rethinking them.). The stack is really tremendously bad and unstable, and it makes software on top of it unstable.

                      It’s not something I want my VCS touching. So avoiding all of that and the resulting increase in stability is a huge reason to embed the Python interpreter IMO. (BTW Oil is getting rid of the Python interpreter altogether, but Python was hugely helpful in figuring out the algorithms, data structures, and architecture. That’s what I like Python for.)

                      1. 7

                        Thank you for the thoughtful comments.

                        The subject of bundling a Python interpreter is complex. I’m a proponent of bundling the Python interpreter with Mercurial (or most Python applications for that matter) because it reduces the surface area of variability. It that were the exclusive mode of distribution, we could distribute the latest, greatest Python interpreter and drop support for the older version quickly after new versions are released. Wouldn’t that be nice!

                        Of course, distributing your own interpreter means now you are responsible for shipping security updates in Python (or potentially any of its dependencies), effectively meaning you need to be prepared to release at any time.

                        Then there’s the pesky problem of actually executing on application distribution. It’s a hard problem space that requires resources. It has historically not been prioritized outside of Windows in the Mercurial project because of lack of time/resources/expertise. (Windows installers exist likely because the alternative is nobody uses your software otherwise since they can’t install it!)

                        There’s also the problem of Linux distributions, which treat Python distribution fundamentally differently from how it treats compiled languages. Distributions would insist on unbundling the Python interpreter from Mercurial as well as 3rd party libraries which they have alternate means of distributing. This creates a myriad of problems and slows everyone down. I recommend reading https://fy.blackhats.net.au/blog/html/2019/12/18/packaging_vendoring_and_how_it_s_changing.html and generally agree with its premise that packaging should revolve more around applications, as it would be more user friendly for application developers and end-users alike.

                        1. 2

                          I agree you might get some eyebrows raised from distro maintainers, but you wouldn’t be alone. I thought Blender also embedded Python but maybe I’m wrong.

                          Based on my limited experience with Oil, I think it would be a minor issue but not a blocking one. CPython is plain C code without dependencies, so it doesn’t cause many problems for distros.

                          https://github.com/oilshell/oil/wiki/Oil-Deployments

                          To me, the Windows example just shows that it’s possible and a known amount of work. It sounds like much less work than the years-long migration that you described.

                          The only reasons I see for migrating are “memes”, misplaced social pressure, and vague fears about security. They don’t seem particularly solid, especially when compared with the downsides of the alternative. Like the possibility for data corruption in a VCS. I understand that it was the “accepted” thing to do, but I’m explicitly questioning the accepted wisdom.

                          I agree there’s room for a CPython post-mortem. But what you described in your post is nothing less than a disaster that’s not over yet! So that might warrant a Mercurial post-mortem as well! I hope that doesn’t come off as rude because it’s not meant to be. I really appreciate you writing this up and I think it got a lot of deserved attention, including from the CPython core team.

                          1. 2

                            Another problem you are likely to encounter is the sheer size of your distributable artifact. As far as I know, there’s no good way to eliminate dead code in Python, so you’ll need to ship all of the code that is imported (including transitively), even in conditional imports. Additionally, the interpreter and many libraries (including standard libraries) depend on shared object libraries, so do you also include those in the bundle? I wouldn’t be surprised at all if any nontrivial application bundle was gigabytes in size, even compressed.

                            1. 2

                              No, that’s not a problem. Oil ships 1.1 MB of code from the CPython binary (under GCC, 1.0 MB under Clang).

                              http://www.oilshell.org/release/0.7.pre11/benchmarks.wwz/ovm-build/

                              I’m pretty sure that’s less than the size of a “hello world” HTTP server for languages like Rust or Go.

                              I removed dead code with the process described in Dev Log #7: Hollowing Out the Python Interpreter, but it was never more than 1.5 MB done naively, so you don’t even need to do that.

                              I imagine mercurial is pretty much like a shell – it reads and writes the file system, and does tons of byte string manipulation.

                              In Python you can generate a module_init table to statically link the libraries you care about. It’s never going to reach gigabytes in any case, or if it was then the equivalent C program would be gigabytes.

                              There are some downsides to what I did, for sure. But what I’m comparing them to is the multiple person-years of work described in the post, and the pages full of downsides.

                              It’s not only doing all that migration work – it’s that the end result is actually worse. He says he anticipates “a long tail of bugs for years” and I would suspect the same. The source code would be in much better shape due to needing to support just one Python version. The effort from maintainers could have gone elsewhere, e.g. to improving the program’s functionality and fixing other bugs.

                              I’m sorry they were in this situation. It sounds like nothing less than a disaster, and when you’re faced with a disaster that should motivate unconventional solutions (which this really isn’t because plenty of apps embed the Python interpreter.)

                              1. 2

                                Sorry, my dead code elimination comment was directed at Python libraries, not the Python interpreter. I believe our 1-year-old Python application bundle (using pex) is on the order of 500 MB compressed, and that’s not including the CPython interpreter, standard libraries, etc; just the application code and third-party dependencies. Most of that is certainly dead code in third-party dependencies. I’m assuming more mature applications are quite a lot larger.

                                I completely agree with your analysis of the unfortunate situation Mercurial found itself in.

                          2. 2

                            I found the previous thread where I commented on security:

                            https://lobste.rs/s/3vkmm8/why_i_can_t_remove_python_2_from_my_systems#c_uxwpzg

                            The tl;dr is that I’m wondering why bundling/embedding wasn’t considered as the FIRST solution, or at least after 2 of the 10 years of struggles with Python 3.

                            My suspicion is that it’s because it feels “wrong” somehow, and because there was some social pressure to abandon Python 2.

                            But as far as Mercurial is concerned, I think that solution is better in literally every dimension of engineering – less short term effort, less long term effort, more stable result, etc.

                            1. 5

                              The tl;dr is that I’m wondering why bundling/embedding wasn’t considered as the FIRST solution, or at least after 2 of the 10 years of struggles with Python 3.

                              Wouldn’t desire to support third-party extensions (written in Python) make this problematic? You’d essentially be creating a Mercurial dialect of Python that drifts from the mainline Python everyone knows over time. Oil’s case is different since Python isn’t part of the exposed API surface.

                              1. 2

                                Yes that’s a good point, I forgot Mercurial had Python plugins.

                                But I would say that you’re breaking the plugins anyway by moving from Python 2 to 3. So that would be an opportunity to make it more language agnostic – and that even has the benefit that you could keep plugins in Python 2 while Mercurial uses Python 3!

                                I’m not sure exactly what the plugins do, but IPC and interchange formats are more robust and less prone to breakage. For example I looked at pandoc recently and it gives you a big JSON structure to manipulate in any language rather than a Haskell API (which would have been a lot easier for them cod to code. I never used it but I’ve seen a lot of systems like this. Git hooks also use textual formats.

                                I have a lot of experience with Python with a plugin language, and while I’d say it’s better than some alternatives, it’s not really a great fit and often ends up getting replaced with something else. The Python version is an issue, even though Python is more stable than many languages.

                          3. 4

                            He mentions one technique that I can confirm is truly excellent for any major refactoring (especially when other programmers are still pouring in code while you’re working…)

                            A stop-gap solution to the b’’ everywhere issue came in July 2016, when I introduced a custom Python module importer that rewrote source code as part of import when running on Python 3. (I have previously blogged about this hack.) What this did was transparently add b’’ prefixes to all un-prefixed string literals as well as modify how a few common functions were called so that we wouldn’t need to modify source code so things would run natively on Python 3. The source transformer allowed us to have the benefits of progressing in our Python 3 port without having to rewrite tens of thousands of lines of source code. The solution was hacky. But it enabled us to make significant progress on the Python 3 port without externalizing a lot of cost onto others.

                            I thought the source transformer would be relatively short-lived and would be removed shortly after the project inevitably decided to go all in on Python 3. To my surprise, others built additional transforms over the years and the source transformer persisted all the way until October 2019, when I removed it just before the first non-alpha Python 3 compatible version of Mercurial was released.

                            ie. Split the problem in a Pareto 80/20 solution. Automate away the 80% that’s automatable, keep a flock of patches that manually fix the 20% that’s hard to automate.

                            Only on the flag day once everything is working do you for the last time run of the transformer, apply the patches and push to mainline.

                            1. 3

                              One hell of a writeup. Found myself reliving various porting difficulties from a past role while reading this.

                              1. 4

                                It seems like a lot of these issues could have been solved by starting much earlier.

                                This ground rule meant that a mass insertion of b’’ prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn’t have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

                                They did need to think about whether a type was a string or a byte array, because to not think about that is simply nonsensical. That’s like not thinking about whether a type is an int or a float. They’re fundamentally completely different things! They’re less alike than int and float, IMO.

                                Then there’s filename handling, where Python assumes the existence of a global encoding for filenames and uses this encoding to convert between str and bytes. And it does this despite POSIX filesystem paths being a bag of bytes where the only rules are that \0 terminates the filename and / is special.

                                Given you can pass bytes to open, it’s not really clear what the issue is meant to be here? I must be misunderstanding something, I guess.

                                1. 3

                                  It’s been awhile since I stared into this particular abyss, but the issues had to do with things like listing files returning Unicode normalized variants, which made getting the real filename problematic. (I don’t remember if this specific thing was one of the issues; just saying, this class of issues.) There’s a lot more than just open when it comes to file names.

                                  1. 0

                                    They did need to think about whether a type was a string or a byte array, because to not think about that is simply nonsensical.

                                    But if you’re working exclusively on English-language software, you can get away with not thinking about it most of the time. Even now, my understanding of unicode issues is relatively abstract, because it hardly ever comes up for me.

                                    1. 1

                                      I assume you mean software with interface in English.

                                      I think any software dealing with human names not using unicode is either broken or of very limited use.

                                      1. 1

                                        Yeah, names don’t really come up, for me, so far.

                                  2. 1

                                    The article mentions the ability to skip blame in commits. That sounds very useful, but is there something for Git?

                                    I’ve also been thinking about a tool that checks the affected lines’ previous commit and stores it in the commit message, but that approach has mostly corner cases.

                                    Anyone have experience in this space outside Mercurial?