1. 2

    PyOxidizer is a Rust application and requires Rust 1.33+ to be installed in order to build binaries.

    Hmm, it would have been nicer if PyOxidizer had been meta. i.e. it itself had a version that was self contained single file executable, so that those of us who are not interested in installing yet another language tool chain on our computers could grumble less.

    1. 7

      I acknowledged this in the post:

      It is a bit unfortunate that I force users to install Rust before using PyOxidizer, but in my defense the target audience is technically savvy developers, bootstrapping Rust is easy, and PyOxidizer is young, so I think it is acceptble for now.

      I will almost certainly provide pre-built executables once the project is more mature. Thank you for the feedback.

        1. 1

          That’s great! Always on the lookout for good ways to distribute Python code to the end user. I generally deal with CLI programs, but I’ve created PySide based programs and programs using other toolkits. The other tools I’ve used (PyInstaller, cx_Freeze type things) tend to not do well with some frameworks. Hope this will deal with those too!

          1. 1

            I applaud you for your strategy and tactics! Wonderfully done. I was thinking of similar for a different language. I will really have to deconstruct what you have done here.

            What was your inspiration? Are there similar systems? What is your long term goal?

            What would it take to support PyPy?

            1. 11

              Inspiration was a few things.

              First, I’m a core contributor to Mercurial, which is a massive, system’s level (mostly) Python application. From a packaging and distribution standpoint, my personal belief is that Python hinders Mercurial. We can’t aggressively adopt modern Python versions, there’s a lot of variance in Python execution environments that create a long tail of bugs, etc. On the performance front, Python’s startup overhead is pretty poor. This prevents things like hg status from being perceived as instantaneous. It also slows down the test suite by literally minutes on some machines. And packaging Mercurial for multiple platforms takes a lot of effort because this isn’t a solved problem for the Python ecosystem. While there are a lot of good things about Mercurial being implemented in Python (and I don’t want to be perceived as advocating for porting Mercurial away from Python - because I don’t), it feels like Mercurial is constantly testing the limits of Python on a few fronts. This isn’t good for Mercurial. It isn’t a good sign for the longevity of Python if they can’t “support” large, mature projects like Mercurial.

              So a big source of inspiration was… frustration, specifically around how it felt that Python was limiting Mercurial’s potential.

              Another source of inspiration was my general attitude of not accepting the status quo. I’m always thinking about why things have to be the way they are. A large part of PyOxidizer was me questioning “why can’t I have a single file executable that runs Python?” I helped maintain Firefox’s build system for several years and knew enough about the low-level build system bits to understand what the general problems with binary distribution were. I knew enough about CPython’s internals (from developing Python C extensions) that I had confidence to dive really deep to be able to implement PyOxidizer. I felt like I knew enough about some relatively esoteric systems (notably build systems and CPython internals) to realize that others who had ventured into the Python application packaging space were attempting to solve this problem within constraints given to them by how CPython is commonly compiled. I realized I possessed the knowledge to change the underlying system and to coerce it to do what I wanted (namely produce self-contained executables containing Python). In other words, I changed the rules and created a new opportunity for PyOxidizer to do something that nobody had ever done in the public domain (Google has produced self-contained Python executables for years using similar but sufficiently different techniques).

              If you want to learn more about the technical journey, I suggest reading https://gregoryszorc.com/blog/2018/12/18/distributing-standalone-python-applications/.

              As for similar systems, other than WASM, I’m not aware of other “interpreted/scripting languages” (like Python) that have solutions that do what PyOxidizer does. I’m sure they exist. But Python is the only language in this language space that I’ve used significantly in the past ~8 years and I’m not sure what other languages have implemented. Obviously you can do these single executable tricks in compiled languages like Go and Rust.

              My long term goal is to make Python application distribution painless. A secondary goal is to make Python applications developed with PyOxidizer “better” than normal Python applications. This can be done through features like an integrated command server and providing Rust and Python access to each other’s capabilities. I want Python application maintainers to focus on building great applications, not worry about packaging/distribution.

              PyPy could theoretically be supported if someone produces a Python distribution conforming to the format documented by https://github.com/indygreg/python-build-standalone. In theory, PyOxidizer is agnostic about the flavor of the Python distribution it uses as long as that Python distribution provides object files that can be relinked and implements aspects of Python’s C API for interpreter control. There would likely need to be a few pieces in PyOxidizer, such as how the extension modules C array is defined. There’s probably a way to express this in python-build-standalone’s distribution descriptor JSON document such that it can be abstracted across distributions. I would very much like to support PyPy and I envision I will cross this bridge eventually. I think there are more important features to work on first, such as compiling C extensions and actually making distribution easy.

              1. 1

                Thank you for such a detailed response.

                I love that you have standardized an interface contract for Python runtimes.

                This looks like it could give organizations confidence in their deployment runtime, while no longer being tied to specific distros and can start using more niche and esoteric libraries that might be difficult to install. This is a form of containerization for Python applications.

                What I am really interested in, is because you own both sides of the system, is streamlining the bidirectional call boundary. Having control over the shell that runs on the host, the Rust wrapper and the VM there is an opportunity to short circuit some of the expense of calling into C or how data is laid out in memory. In a quick ripgrep through the code, I couldn’t find any reference to cffi. Do you plan on supporting cffi or is it already handled? I am really curious to learn about what your integration plans look like. Great work.

          2. 1

            That’s an awful lot of heavy lifting you’re asking from a tool maintainer.

            And, I mean, ‘brew/apt/yum install rust’ isn’t generally a particularly big ask of you, the end user :)

            1. 4

              But, but, this is solving pip install x … don’t you think at least the irony should be acknowledged?

          1. 11

            As the developer of a version control tool (Mercurial) and a (former) maintainer of a large build system (Firefox), I too have often asked myself how - not if - version control and build systems will merge - or at least become much more tightly integrated. And I also throw filesystems and distributed execution / CI into the mix for good measure because version control is a specialized filesystem and CI tends to evolve into a distributed build system. There’s a lot of inefficiency at scale due to the strong barriers we tend to erect between these components. I think there are compelling opportunities for novel advances in this space. How things will actually materialize, I’m not sure.

            1. 1

              I agree also , there is quite a bit of opportunity for innovation around this. I am thinking at a slightly different angle.

              There is an opportunity for creating a temporal aware file system, revision control, emulation environment, build system. All linked by same temporal time line. A snapshot, yes. but across all these things.

              take a look at https://sirix.io/

              Imagining a bit, but it could server as a ‘file system’ for the emulation environment. It could also enhance version control system, where the versioning/snapshotting happens at sirix.io level

              While I am not working with 100s developers these days, I am noticing that environment/build control is much easier in our Android development world – because we control a) emulator b) build environment

              So it is very reproducible: same OS image, same emulator (KVM or Hyper-V), same build through gradle (we also do not allow wildcard for package versions, only exact versions).

              Working on backend with, say C++ (or other lang that rely on OS provided includes/libs) – very different story, very difficult to replicate without introducing an ‘emulator’ (where we can control a standardized OS image for build/test cycle).

            1. 14

              The post mortem for this should be a good read. But first, let’s hope there’s a resolution soon, because this is a highly disruptive issue for Firefox users and could lead to users abandoning Firefox over. It’s also a rough situation for the unfortunate Mozilla employees who have to deal with this going into the weekend.

              FWIW one of the Firefox security team members who would be on my short list for “person in charge of renewing this certificate” is currently in the middle of a multi-week vacation. This is pure speculation on my part, but I wouldn’t be surprised if a contributing cause to this incident were that the renewal reminder emails for this certificate were going to the inbox of someone not checking their email while on vacation. But I suspect there wasn’t a single point of failure here because the people who manage these certificates at Mozilla are typically very on top of their game and are some of the best security people I’ve interacted with. I’m quite surprised this occurred and suspect there are multiple contributing causes. We’ll just have to wait for the post mortem to see.

              1. 7

                Also, it takes Mozilla 18-24 hours to push an emergency release (a “chemspill” in Mozilla parlance). So if a new binary needs to be pushed out to users, I wouldn’t expect one until around 00:00 UTC.

                1. 1

                  I don’t think this needs a new release. “just” a new cert, no? Keys aren’t hard-coded afair

              1. 7

                I agree with the premise of the post that Git doesn’t do a good job supporting monorepos. Assuming the scaling problem of large repositories will go away with time, there is still the issue of how clients should interact with a monorepo. e.g. clients often don’t need every file at a particular commit or want the full history of the repo or the files being accessed. The feature support and UI for facilitating partial repo access is still horribly lacking.

                Git has the concept of a “sparse checkout” where only a subset of files in a commit are manifested in the working directory. This is a powerful feature for monorepos, as it allows clients to only interact with files relevant to the given operation. Unfortunately, the UI for sparse checkouts in Git is horrible: it requires writing out file patterns to the .git/info/sparse-checkout file and running a sequence of commands in just the right order for it to work. Practically nobody knows how to do this off the top of their head and anyone using sparse checkouts probably has the process abstracted away via a script. In contrast, I will point out that Mercurial allows you to store a file in the repository containing the patterns that constitute the “sparse profile” and when you do a clone or update, you can specify the path to the file containing the “sparse profile” and Mercurial takes care of fetching the file with sparse file patterns and expanding it to rules to populate the repository history and working directory. This is vastly more user intuitive than what Git provides for managing sparse checkouts. Not perfect, but much, much better. I encourage Git to steal this feature.

                Another monorepo feature that is yet unexplored in both Git and Mercurial is partial repository branches and tags. Branches and tags are global to the entire repository. But for monorepos comprised of multiple projects, global branches and tags may not be appropriate. People may want branches and tags that only apply to a subset of the repo. If nothing else this can cut down on “symbol pollution.” This isn’t a radical idea, as per-project branches and tags are supported by version control systems like Subversion and CVS.

                1. 5

                  I agree with you, git famously was not designed for monorepo.

                  Also agreed, sub-tree checkouts and sub-tree history would be essential for monorepos. Nobody wants to see every file from every obscure project in their repo clones, it would eat up your attention.

                  I would also like storing giant asset files in repo ( without the git-lfs hack ), more consistent commands, some sort of API where compilers and build systems can integrate into revision control etc. Right now, it seems we have more and more tooling on top of Git to make it work in all these conditions while git was designed to manage a single text file based repo, namely the Linux kernel.

                1. 3

                  It’s worth reminding everyone that PGP keys have expiration times and can be revoked. So if you put PGP signatures into Git, it is possible that signature verification works today but not tomorrow. (GPG and other tools will refuse to verify signatures if they belong to expired or revoked keys.) http://karl.kornel.us/2017/10/welp-there-go-my-git-signatures/ goes into more detail on the problem and http://mikegerwitz.com/papers/git-horror-story is always a terrific read. In my opinion, this is a very nasty limitation and therefore using PGP for signatures in a VCS is extremely brittle and should be done with extreme care.

                  In order to solve this general problem of not being able to validate signatures in the future, the VCS needs to manage keys for you (so you always have access to the key). And you probably don’t want to use PGP because tools enforce expiration and revocation. Key management is of course a hard problem and increases the complexity of the VCS. For what it’s worth, the Monotone VCS has built-in support for managing certificates (which are backed by RSA keys). See https://www.monotone.ca/docs/Certificates.html. https://www.mercurial-scm.org/wiki/CommitSigningPlan captures a lot of context about this general problem.

                  1. 30

                    A generic solution that doesn’t require Docker is a tool/library called eatmydata: https://github.com/stewartsmith/libeatmydata.

                    Using LD_PRELOAD or a wrapper executable, libeatmydata essentially turns fsync() and other APIs that try to ensure durability into no-ops. If you don’t care about data durability, you can aggressively enable eatmydata to get a substantial speedup for workloads that call into these expensive APIs.

                    1. 8

                      eatmydata is also useful when testing other applications that [ab]use fsync including build systems.

                      fsync() is also the reason why people believe that ramfs are just/always faster than drives. Very often the kernel is doing a good job at caching data in memory and drives perform as well as ramfs… once you disabled fsync.

                      1. 1

                        At least on Linux (where I’ve measured it), tmpfs is in fact significantly faster than persistent filesystems even for cached (purely in-memory) operations.

                        Whether your application is filesystem-intensive enough for it to matter is another question.

                        1. 1

                          https://superuser.com/a/227714/122260

                          …did you disable fsync() in your benchmarks?

                          1. 1

                            My measurements were taken with purpose-built, hand-written microbenchmarks. There was no fsync to “disable”.

                    1. 3

                      This is a fantastic interview. Pablo Santos articulates some features/advantages that Plastic SCM has compared to other tools. His vision for the role and future of version control is eerily similar to what is floating around in my head (I’m a contributor to Mercurial). He even talks a bit about the need for (version control) tools to be fast, get out of the way, and provide better features in order to avoid cognitive overhead, which impacts productivity and velocity. This is a must-listen for anyone who works on version control tools or cares about the larger developer productivity space.

                      1. 2

                        How realistic do you think it is for the Git to evolve to support big files?

                        As I understand it the problem boils down to three issues:

                        1. Every file is hashed completely before being stored as a Blob
                        2. A git checkout checks out the whole tree, it’s not possible to checkout a subset
                        3. There is no lazy-fetching of the Git objects. It’s possible to do a shallow fetch but then Git operations are limited.

                        (1) means that even a 1 byte change will create a whole new Blob. I think that it could be improved by introducing a new “BlobList” type of object that can contain a list of Blob. Then the update would be only on the size of the Blob. Blob chunking heuristics can then be developed and used at insertion time.

                        (2) means re-thinking a lot of the CLI operations to work on a subset

                        (3) would have to re-design the database to lazily fetch objects upstream when they are missing

                        1. 1

                          I couldn’t agree more. I always scratched my head thinking why version control, being the “operative system” of software development did not lead the Agile, DevOps… you name it, modern software development “movement”. In a way I see that the dominance of Git raised the bar so high, innovation was not required for a long time. On the other hand, being so generalist makes changes in roadmap of Git difficult to cover al the most innovative edge cases, right?

                          1. 1

                            Hey, Greg, you should get a hat!

                          1. 6

                            Nice article!

                            • Regarding startup time, I heard about the Mercurial command server a long time ago, so this was a good reminder. The idea of the “coprocess protocol” in Dev Log #8: Shell Protocol Designs is to make it easy for EVERY Unix binary to be a command server, no matter what language it’s written in.

                              I have a cool demo that uses some file descriptor tricks to accomplish this with minimal modifications to the code. It won’t work on Windows though, which may be an issue for some tools like Mercurial.

                            • I also embed the Python interpreter and ship it with Oil, which reduces startup time. sys.path has a single entry for Python modules, and every C module is statically linked. I thought about making this reusable, but it’s a pretty messy process.

                              Rewriting Python’s Build System From Scratch

                              Dev Log #7: Hollowing Out the Python Interpreter

                            • Regarding function call overhead, attribute access, and object creation, the idea behind “OPy” is to address those things for Oil, although I haven’t gotten very far along with that work :-)

                              http://www.oilshell.org/blog/tags.html?tag=opy#opy

                            I guess the bottom line is that we’re both stretching Python beyond its limits :-/ It’s a nice and productive language so that tends to happen.

                            1. 1

                              Thank you for the context! It is… eerie that we seem to have gone down similar rabbit holes with embedding/distributing Python!

                              You may be interested in https://github.com/indygreg/python-build-standalone, which is the sister project to PyOxidizer and aims to produce highly portable Python distributions. There’s still a ways to go. But you may find it useful as a mechanism to produce CPython build artifacts (and their dependencies) in such a way that can easily be recombined into a larger binary, such as Oil.

                              1. 1

                                The general coprocess protocol looks really interesting. It might be nice to surface it somewhere more visible and trackable…github wiki pages are notoriously bad for being able to keep up with changes to docs like this.