1. 19

    Former Mozilla employee here.

    This wiki page is ancient. There was a strong anti SQLite phase several years ago. I’m not sure where things are today, but by the time I left, people had warmed back up to SQLite a bit. Although many points in the wiki page are still accurate.

    I think a large part of the anti SQLite mindset in Firefox was driven by a combination of lack of understanding on how to optimally use SQLite and actually achieving an optimal configuration through all consumers of it. In its default configuration, SQLite will fsync() after every transaction. And SQLite implicitly creates a transaction for every statement performing a mutation. This can suck performance like no other. You have to issue a bunch of PRAGMA statements when opening a database to get things to behave a bit better. This sacrifices durability. But many database writes aren’t critical and can be safely lost. e.g. do you care if your browser history for number of times you visited a page is off by 1?

    The performance impact of continuous SQLite writes with high durability guarantees is demonstrated in my favorite commit which I authored while at Mozilla: https://hg.mozilla.org/mozilla-central/rev/6b064f9f6e10b25bf2edc916b1b9a8b78cff0465. tl;dr a test harness loading thousands of pages and triggering writes to the “places” database (which holds site visit information) was incurring ~50 GB of I/O during a single test run.

    Compounding matters was Firefox’s code historically performed I/O on the main thread. So if there was a read()/write() performed on the main thread (potentially via SQLite APIs), the UI would freeze until it was serviced. Hopefully I don’t have to explain the problem with this model.

    One of my contributions to Firefox was SQLite.jsm (https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Sqlite.jsm), a JavaScript module providing access to SQLite APIs with batteries included so consumers would get reasonable behavior by default. The introduction of this module and subsequent adoption helped fix a lot of the performance issues attributed to SQLite.

    IMO the biggest issue with Firefox’s data storage was there were literally dozens of components each rolling their own storage format and persistence model. This ranged from SQLite databases, JSON files, plain text files, bespoke file formats, etc. This represented 10+ years of accumulated technical debt. There have been various efforts to unify storage. But I’m unsure where they are these days.

    1. 12

      FWIW GitLab uses a gRPC based solution (Gitaly) for all Git repository interactions, including Git wire protocol traffic (the Git protocol data is treated as a raw stream). See the protocol definition at https://gitlab.com/gitlab-org/gitaly/tree/master/proto. This allows GitLab to abstract the storage of Git repositories behind a gRPC interface.

      Fun fact: this is how Heptapod (https://foss.heptapod.net/heptapod/heptapod) - a fork of GitLab that supports Mercurial - works: they’ve taught Mercurial to answer the gRPC queries that Gitaly defines. GitLab issues gRPC requests and they are answered by Mercurial instead of Git. Most functionality “just works” and doesn’t care that Mercurial - not Git - is providing data. Abstractions and interfaces can be very powerful…

      1. 2

        That’s interesting! I’ve looked at doing remote Git operations by creating an abstraction over the object store. The benefit is that the interface is much smaller than what you linked to. I guess the downside is higher latency for operations that need several round-trips. Do you know if that has been explored?

        1. 5

          GitLab’s/Gitaly’s RPC protocol is massive. My understanding is they pretty much invent a specialized RPC method for every use case they have so they can avoid fragmented transactions, excessive round trips, etc. The RPC is completely internal to GitLab and doesn’t need to be backwards or forwards compatible over a long time horizon. So they can get away with high amounts of churn and experimentation. It’s a terrific solution for an internal RPC. That approach to protocol design won’t work for Git itself, however.

      1. 9

        Wow I really sympathize with all of this. I can see how poor a fit Python 3 is for Mercurial, having ported Oil from Python 2 to 3, and then back to 2 again, mainly for the reason of strings. (That was early in the project’s life and it didn’t have users, so it was easy.)

        I agree with all this:

        the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications

        We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started

        I think we talked about this before, and you mentioned PyOxidizer, which I again saw in the post.

        But after reading all this exhausting effort, I’m still left thinking that it would have been less effort and you would have a better result if Mercurial had bundled Python interpreter.

        I feel like that’s 10x less work than I read in the post, and it would have taken 10x less time, and you would have the better model of UTF-8 strings.

        It doesn’t matter if distros don’t package Python 2 – because it can be in the Mercurial tarball. (People keep invoking “security” but I think that’s a naive view of security. If someone asks I’ll dig up my previous comment on that. Also I don’t think the Python 2.7 codebase is that hard to maintain and you can get rid of > 50% of it.).

        I guess it doesn’t matter now, but honestly reading the post confirmed all the feelings I had and I personally would have abandoned such an effort years in advance. Despite my love for Python, and using it for decades, it’s not a stable enough abstraction for certain applications, including a shell and a version control system (e.g. ask me about my EINTR backports). To be fair, very few languages are suitable for writing a POSIX-compatible shell – e.g. I claim Go can’t and won’t be used for this task because of its threaded runtime.

        In fact I used to be a Mercurial user and I remember getting ImportError because of some “fighting” over the PYTHONPATH that distros and the numerous layers of package managers do. I still use distutils and tarballs for important applications because it’s stable. I use virtualenv reluctantly, and try avoiding pip / setuptools (and I roll my eyes whenever I hear about a Python package manager that adds layers rather than rethinking them.). The stack is really tremendously bad and unstable, and it makes software on top of it unstable.

        It’s not something I want my VCS touching. So avoiding all of that and the resulting increase in stability is a huge reason to embed the Python interpreter IMO. (BTW Oil is getting rid of the Python interpreter altogether, but Python was hugely helpful in figuring out the algorithms, data structures, and architecture. That’s what I like Python for.)

        1. 7

          Thank you for the thoughtful comments.

          The subject of bundling a Python interpreter is complex. I’m a proponent of bundling the Python interpreter with Mercurial (or most Python applications for that matter) because it reduces the surface area of variability. It that were the exclusive mode of distribution, we could distribute the latest, greatest Python interpreter and drop support for the older version quickly after new versions are released. Wouldn’t that be nice!

          Of course, distributing your own interpreter means now you are responsible for shipping security updates in Python (or potentially any of its dependencies), effectively meaning you need to be prepared to release at any time.

          Then there’s the pesky problem of actually executing on application distribution. It’s a hard problem space that requires resources. It has historically not been prioritized outside of Windows in the Mercurial project because of lack of time/resources/expertise. (Windows installers exist likely because the alternative is nobody uses your software otherwise since they can’t install it!)

          There’s also the problem of Linux distributions, which treat Python distribution fundamentally differently from how it treats compiled languages. Distributions would insist on unbundling the Python interpreter from Mercurial as well as 3rd party libraries which they have alternate means of distributing. This creates a myriad of problems and slows everyone down. I recommend reading https://fy.blackhats.net.au/blog/html/2019/12/18/packaging_vendoring_and_how_it_s_changing.html and generally agree with its premise that packaging should revolve more around applications, as it would be more user friendly for application developers and end-users alike.

          1. 2

            I agree you might get some eyebrows raised from distro maintainers, but you wouldn’t be alone. I thought Blender also embedded Python but maybe I’m wrong.

            Based on my limited experience with Oil, I think it would be a minor issue but not a blocking one. CPython is plain C code without dependencies, so it doesn’t cause many problems for distros.

            https://github.com/oilshell/oil/wiki/Oil-Deployments

            To me, the Windows example just shows that it’s possible and a known amount of work. It sounds like much less work than the years-long migration that you described.

            The only reasons I see for migrating are “memes”, misplaced social pressure, and vague fears about security. They don’t seem particularly solid, especially when compared with the downsides of the alternative. Like the possibility for data corruption in a VCS. I understand that it was the “accepted” thing to do, but I’m explicitly questioning the accepted wisdom.

            I agree there’s room for a CPython post-mortem. But what you described in your post is nothing less than a disaster that’s not over yet! So that might warrant a Mercurial post-mortem as well! I hope that doesn’t come off as rude because it’s not meant to be. I really appreciate you writing this up and I think it got a lot of deserved attention, including from the CPython core team.

            1. 2

              Another problem you are likely to encounter is the sheer size of your distributable artifact. As far as I know, there’s no good way to eliminate dead code in Python, so you’ll need to ship all of the code that is imported (including transitively), even in conditional imports. Additionally, the interpreter and many libraries (including standard libraries) depend on shared object libraries, so do you also include those in the bundle? I wouldn’t be surprised at all if any nontrivial application bundle was gigabytes in size, even compressed.

              1. 2

                No, that’s not a problem. Oil ships 1.1 MB of code from the CPython binary (under GCC, 1.0 MB under Clang).

                http://www.oilshell.org/release/0.7.pre11/benchmarks.wwz/ovm-build/

                I’m pretty sure that’s less than the size of a “hello world” HTTP server for languages like Rust or Go.

                I removed dead code with the process described in Dev Log #7: Hollowing Out the Python Interpreter, but it was never more than 1.5 MB done naively, so you don’t even need to do that.

                I imagine mercurial is pretty much like a shell – it reads and writes the file system, and does tons of byte string manipulation.

                In Python you can generate a module_init table to statically link the libraries you care about. It’s never going to reach gigabytes in any case, or if it was then the equivalent C program would be gigabytes.

                There are some downsides to what I did, for sure. But what I’m comparing them to is the multiple person-years of work described in the post, and the pages full of downsides.

                It’s not only doing all that migration work – it’s that the end result is actually worse. He says he anticipates “a long tail of bugs for years” and I would suspect the same. The source code would be in much better shape due to needing to support just one Python version. The effort from maintainers could have gone elsewhere, e.g. to improving the program’s functionality and fixing other bugs.

                I’m sorry they were in this situation. It sounds like nothing less than a disaster, and when you’re faced with a disaster that should motivate unconventional solutions (which this really isn’t because plenty of apps embed the Python interpreter.)

                1. 2

                  Sorry, my dead code elimination comment was directed at Python libraries, not the Python interpreter. I believe our 1-year-old Python application bundle (using pex) is on the order of 500 MB compressed, and that’s not including the CPython interpreter, standard libraries, etc; just the application code and third-party dependencies. Most of that is certainly dead code in third-party dependencies. I’m assuming more mature applications are quite a lot larger.

                  I completely agree with your analysis of the unfortunate situation Mercurial found itself in.

            2. 2

              I found the previous thread where I commented on security:

              https://lobste.rs/s/3vkmm8/why_i_can_t_remove_python_2_from_my_systems#c_uxwpzg

              The tl;dr is that I’m wondering why bundling/embedding wasn’t considered as the FIRST solution, or at least after 2 of the 10 years of struggles with Python 3.

              My suspicion is that it’s because it feels “wrong” somehow, and because there was some social pressure to abandon Python 2.

              But as far as Mercurial is concerned, I think that solution is better in literally every dimension of engineering – less short term effort, less long term effort, more stable result, etc.

              1. 5

                The tl;dr is that I’m wondering why bundling/embedding wasn’t considered as the FIRST solution, or at least after 2 of the 10 years of struggles with Python 3.

                Wouldn’t desire to support third-party extensions (written in Python) make this problematic? You’d essentially be creating a Mercurial dialect of Python that drifts from the mainline Python everyone knows over time. Oil’s case is different since Python isn’t part of the exposed API surface.

                1. 2

                  Yes that’s a good point, I forgot Mercurial had Python plugins.

                  But I would say that you’re breaking the plugins anyway by moving from Python 2 to 3. So that would be an opportunity to make it more language agnostic – and that even has the benefit that you could keep plugins in Python 2 while Mercurial uses Python 3!

                  I’m not sure exactly what the plugins do, but IPC and interchange formats are more robust and less prone to breakage. For example I looked at pandoc recently and it gives you a big JSON structure to manipulate in any language rather than a Haskell API (which would have been a lot easier for them cod to code. I never used it but I’ve seen a lot of systems like this. Git hooks also use textual formats.

                  I have a lot of experience with Python with a plugin language, and while I’d say it’s better than some alternatives, it’s not really a great fit and often ends up getting replaced with something else. The Python version is an issue, even though Python is more stable than many languages.

            1. 2

              PyOxidizer is a Rust application and requires Rust 1.33+ to be installed in order to build binaries.

              Hmm, it would have been nicer if PyOxidizer had been meta. i.e. it itself had a version that was self contained single file executable, so that those of us who are not interested in installing yet another language tool chain on our computers could grumble less.

              1. 7

                I acknowledged this in the post:

                It is a bit unfortunate that I force users to install Rust before using PyOxidizer, but in my defense the target audience is technically savvy developers, bootstrapping Rust is easy, and PyOxidizer is young, so I think it is acceptble for now.

                I will almost certainly provide pre-built executables once the project is more mature. Thank you for the feedback.

                  1. 1

                    That’s great! Always on the lookout for good ways to distribute Python code to the end user. I generally deal with CLI programs, but I’ve created PySide based programs and programs using other toolkits. The other tools I’ve used (PyInstaller, cx_Freeze type things) tend to not do well with some frameworks. Hope this will deal with those too!

                    1. 1

                      I applaud you for your strategy and tactics! Wonderfully done. I was thinking of similar for a different language. I will really have to deconstruct what you have done here.

                      What was your inspiration? Are there similar systems? What is your long term goal?

                      What would it take to support PyPy?

                      1. 11

                        Inspiration was a few things.

                        First, I’m a core contributor to Mercurial, which is a massive, system’s level (mostly) Python application. From a packaging and distribution standpoint, my personal belief is that Python hinders Mercurial. We can’t aggressively adopt modern Python versions, there’s a lot of variance in Python execution environments that create a long tail of bugs, etc. On the performance front, Python’s startup overhead is pretty poor. This prevents things like hg status from being perceived as instantaneous. It also slows down the test suite by literally minutes on some machines. And packaging Mercurial for multiple platforms takes a lot of effort because this isn’t a solved problem for the Python ecosystem. While there are a lot of good things about Mercurial being implemented in Python (and I don’t want to be perceived as advocating for porting Mercurial away from Python - because I don’t), it feels like Mercurial is constantly testing the limits of Python on a few fronts. This isn’t good for Mercurial. It isn’t a good sign for the longevity of Python if they can’t “support” large, mature projects like Mercurial.

                        So a big source of inspiration was… frustration, specifically around how it felt that Python was limiting Mercurial’s potential.

                        Another source of inspiration was my general attitude of not accepting the status quo. I’m always thinking about why things have to be the way they are. A large part of PyOxidizer was me questioning “why can’t I have a single file executable that runs Python?” I helped maintain Firefox’s build system for several years and knew enough about the low-level build system bits to understand what the general problems with binary distribution were. I knew enough about CPython’s internals (from developing Python C extensions) that I had confidence to dive really deep to be able to implement PyOxidizer. I felt like I knew enough about some relatively esoteric systems (notably build systems and CPython internals) to realize that others who had ventured into the Python application packaging space were attempting to solve this problem within constraints given to them by how CPython is commonly compiled. I realized I possessed the knowledge to change the underlying system and to coerce it to do what I wanted (namely produce self-contained executables containing Python). In other words, I changed the rules and created a new opportunity for PyOxidizer to do something that nobody had ever done in the public domain (Google has produced self-contained Python executables for years using similar but sufficiently different techniques).

                        If you want to learn more about the technical journey, I suggest reading https://gregoryszorc.com/blog/2018/12/18/distributing-standalone-python-applications/.

                        As for similar systems, other than WASM, I’m not aware of other “interpreted/scripting languages” (like Python) that have solutions that do what PyOxidizer does. I’m sure they exist. But Python is the only language in this language space that I’ve used significantly in the past ~8 years and I’m not sure what other languages have implemented. Obviously you can do these single executable tricks in compiled languages like Go and Rust.

                        My long term goal is to make Python application distribution painless. A secondary goal is to make Python applications developed with PyOxidizer “better” than normal Python applications. This can be done through features like an integrated command server and providing Rust and Python access to each other’s capabilities. I want Python application maintainers to focus on building great applications, not worry about packaging/distribution.

                        PyPy could theoretically be supported if someone produces a Python distribution conforming to the format documented by https://github.com/indygreg/python-build-standalone. In theory, PyOxidizer is agnostic about the flavor of the Python distribution it uses as long as that Python distribution provides object files that can be relinked and implements aspects of Python’s C API for interpreter control. There would likely need to be a few pieces in PyOxidizer, such as how the extension modules C array is defined. There’s probably a way to express this in python-build-standalone’s distribution descriptor JSON document such that it can be abstracted across distributions. I would very much like to support PyPy and I envision I will cross this bridge eventually. I think there are more important features to work on first, such as compiling C extensions and actually making distribution easy.

                        1. 1

                          Thank you for such a detailed response.

                          I love that you have standardized an interface contract for Python runtimes.

                          This looks like it could give organizations confidence in their deployment runtime, while no longer being tied to specific distros and can start using more niche and esoteric libraries that might be difficult to install. This is a form of containerization for Python applications.

                          What I am really interested in, is because you own both sides of the system, is streamlining the bidirectional call boundary. Having control over the shell that runs on the host, the Rust wrapper and the VM there is an opportunity to short circuit some of the expense of calling into C or how data is laid out in memory. In a quick ripgrep through the code, I couldn’t find any reference to cffi. Do you plan on supporting cffi or is it already handled? I am really curious to learn about what your integration plans look like. Great work.

                    2. 1

                      That’s an awful lot of heavy lifting you’re asking from a tool maintainer.

                      And, I mean, ‘brew/apt/yum install rust’ isn’t generally a particularly big ask of you, the end user :)

                      1. 4

                        But, but, this is solving pip install x … don’t you think at least the irony should be acknowledged?

                    1. 11

                      As the developer of a version control tool (Mercurial) and a (former) maintainer of a large build system (Firefox), I too have often asked myself how - not if - version control and build systems will merge - or at least become much more tightly integrated. And I also throw filesystems and distributed execution / CI into the mix for good measure because version control is a specialized filesystem and CI tends to evolve into a distributed build system. There’s a lot of inefficiency at scale due to the strong barriers we tend to erect between these components. I think there are compelling opportunities for novel advances in this space. How things will actually materialize, I’m not sure.

                      1. 1

                        I agree also , there is quite a bit of opportunity for innovation around this. I am thinking at a slightly different angle.

                        There is an opportunity for creating a temporal aware file system, revision control, emulation environment, build system. All linked by same temporal time line. A snapshot, yes. but across all these things.

                        take a look at https://sirix.io/

                        Imagining a bit, but it could server as a ‘file system’ for the emulation environment. It could also enhance version control system, where the versioning/snapshotting happens at sirix.io level

                        While I am not working with 100s developers these days, I am noticing that environment/build control is much easier in our Android development world – because we control a) emulator b) build environment

                        So it is very reproducible: same OS image, same emulator (KVM or Hyper-V), same build through gradle (we also do not allow wildcard for package versions, only exact versions).

                        Working on backend with, say C++ (or other lang that rely on OS provided includes/libs) – very different story, very difficult to replicate without introducing an ‘emulator’ (where we can control a standardized OS image for build/test cycle).

                      1. 14

                        The post mortem for this should be a good read. But first, let’s hope there’s a resolution soon, because this is a highly disruptive issue for Firefox users and could lead to users abandoning Firefox over. It’s also a rough situation for the unfortunate Mozilla employees who have to deal with this going into the weekend.

                        FWIW one of the Firefox security team members who would be on my short list for “person in charge of renewing this certificate” is currently in the middle of a multi-week vacation. This is pure speculation on my part, but I wouldn’t be surprised if a contributing cause to this incident were that the renewal reminder emails for this certificate were going to the inbox of someone not checking their email while on vacation. But I suspect there wasn’t a single point of failure here because the people who manage these certificates at Mozilla are typically very on top of their game and are some of the best security people I’ve interacted with. I’m quite surprised this occurred and suspect there are multiple contributing causes. We’ll just have to wait for the post mortem to see.

                        1. 7

                          Also, it takes Mozilla 18-24 hours to push an emergency release (a “chemspill” in Mozilla parlance). So if a new binary needs to be pushed out to users, I wouldn’t expect one until around 00:00 UTC.

                          1. 1

                            I don’t think this needs a new release. “just” a new cert, no? Keys aren’t hard-coded afair

                        1. 7

                          I agree with the premise of the post that Git doesn’t do a good job supporting monorepos. Assuming the scaling problem of large repositories will go away with time, there is still the issue of how clients should interact with a monorepo. e.g. clients often don’t need every file at a particular commit or want the full history of the repo or the files being accessed. The feature support and UI for facilitating partial repo access is still horribly lacking.

                          Git has the concept of a “sparse checkout” where only a subset of files in a commit are manifested in the working directory. This is a powerful feature for monorepos, as it allows clients to only interact with files relevant to the given operation. Unfortunately, the UI for sparse checkouts in Git is horrible: it requires writing out file patterns to the .git/info/sparse-checkout file and running a sequence of commands in just the right order for it to work. Practically nobody knows how to do this off the top of their head and anyone using sparse checkouts probably has the process abstracted away via a script. In contrast, I will point out that Mercurial allows you to store a file in the repository containing the patterns that constitute the “sparse profile” and when you do a clone or update, you can specify the path to the file containing the “sparse profile” and Mercurial takes care of fetching the file with sparse file patterns and expanding it to rules to populate the repository history and working directory. This is vastly more user intuitive than what Git provides for managing sparse checkouts. Not perfect, but much, much better. I encourage Git to steal this feature.

                          Another monorepo feature that is yet unexplored in both Git and Mercurial is partial repository branches and tags. Branches and tags are global to the entire repository. But for monorepos comprised of multiple projects, global branches and tags may not be appropriate. People may want branches and tags that only apply to a subset of the repo. If nothing else this can cut down on “symbol pollution.” This isn’t a radical idea, as per-project branches and tags are supported by version control systems like Subversion and CVS.

                          1. 5

                            I agree with you, git famously was not designed for monorepo.

                            Also agreed, sub-tree checkouts and sub-tree history would be essential for monorepos. Nobody wants to see every file from every obscure project in their repo clones, it would eat up your attention.

                            I would also like storing giant asset files in repo ( without the git-lfs hack ), more consistent commands, some sort of API where compilers and build systems can integrate into revision control etc. Right now, it seems we have more and more tooling on top of Git to make it work in all these conditions while git was designed to manage a single text file based repo, namely the Linux kernel.

                          1. 3

                            It’s worth reminding everyone that PGP keys have expiration times and can be revoked. So if you put PGP signatures into Git, it is possible that signature verification works today but not tomorrow. (GPG and other tools will refuse to verify signatures if they belong to expired or revoked keys.) http://karl.kornel.us/2017/10/welp-there-go-my-git-signatures/ goes into more detail on the problem and http://mikegerwitz.com/papers/git-horror-story is always a terrific read. In my opinion, this is a very nasty limitation and therefore using PGP for signatures in a VCS is extremely brittle and should be done with extreme care.

                            In order to solve this general problem of not being able to validate signatures in the future, the VCS needs to manage keys for you (so you always have access to the key). And you probably don’t want to use PGP because tools enforce expiration and revocation. Key management is of course a hard problem and increases the complexity of the VCS. For what it’s worth, the Monotone VCS has built-in support for managing certificates (which are backed by RSA keys). See https://www.monotone.ca/docs/Certificates.html. https://www.mercurial-scm.org/wiki/CommitSigningPlan captures a lot of context about this general problem.

                            1. 30

                              A generic solution that doesn’t require Docker is a tool/library called eatmydata: https://github.com/stewartsmith/libeatmydata.

                              Using LD_PRELOAD or a wrapper executable, libeatmydata essentially turns fsync() and other APIs that try to ensure durability into no-ops. If you don’t care about data durability, you can aggressively enable eatmydata to get a substantial speedup for workloads that call into these expensive APIs.

                              1. 8

                                eatmydata is also useful when testing other applications that [ab]use fsync including build systems.

                                fsync() is also the reason why people believe that ramfs are just/always faster than drives. Very often the kernel is doing a good job at caching data in memory and drives perform as well as ramfs… once you disabled fsync.

                                1. 1

                                  At least on Linux (where I’ve measured it), tmpfs is in fact significantly faster than persistent filesystems even for cached (purely in-memory) operations.

                                  Whether your application is filesystem-intensive enough for it to matter is another question.

                                  1. 1

                                    https://superuser.com/a/227714/122260

                                    …did you disable fsync() in your benchmarks?

                                    1. 1

                                      My measurements were taken with purpose-built, hand-written microbenchmarks. There was no fsync to “disable”.

                              1. 3

                                This is a fantastic interview. Pablo Santos articulates some features/advantages that Plastic SCM has compared to other tools. His vision for the role and future of version control is eerily similar to what is floating around in my head (I’m a contributor to Mercurial). He even talks a bit about the need for (version control) tools to be fast, get out of the way, and provide better features in order to avoid cognitive overhead, which impacts productivity and velocity. This is a must-listen for anyone who works on version control tools or cares about the larger developer productivity space.

                                1. 2

                                  How realistic do you think it is for the Git to evolve to support big files?

                                  As I understand it the problem boils down to three issues:

                                  1. Every file is hashed completely before being stored as a Blob
                                  2. A git checkout checks out the whole tree, it’s not possible to checkout a subset
                                  3. There is no lazy-fetching of the Git objects. It’s possible to do a shallow fetch but then Git operations are limited.

                                  (1) means that even a 1 byte change will create a whole new Blob. I think that it could be improved by introducing a new “BlobList” type of object that can contain a list of Blob. Then the update would be only on the size of the Blob. Blob chunking heuristics can then be developed and used at insertion time.

                                  (2) means re-thinking a lot of the CLI operations to work on a subset

                                  (3) would have to re-design the database to lazily fetch objects upstream when they are missing

                                  1. 1

                                    I couldn’t agree more. I always scratched my head thinking why version control, being the “operative system” of software development did not lead the Agile, DevOps… you name it, modern software development “movement”. In a way I see that the dominance of Git raised the bar so high, innovation was not required for a long time. On the other hand, being so generalist makes changes in roadmap of Git difficult to cover al the most innovative edge cases, right?

                                    1. 1

                                      Hey, Greg, you should get a hat!

                                    1. 6

                                      Nice article!

                                      • Regarding startup time, I heard about the Mercurial command server a long time ago, so this was a good reminder. The idea of the “coprocess protocol” in Dev Log #8: Shell Protocol Designs is to make it easy for EVERY Unix binary to be a command server, no matter what language it’s written in.

                                        I have a cool demo that uses some file descriptor tricks to accomplish this with minimal modifications to the code. It won’t work on Windows though, which may be an issue for some tools like Mercurial.

                                      • I also embed the Python interpreter and ship it with Oil, which reduces startup time. sys.path has a single entry for Python modules, and every C module is statically linked. I thought about making this reusable, but it’s a pretty messy process.

                                        Rewriting Python’s Build System From Scratch

                                        Dev Log #7: Hollowing Out the Python Interpreter

                                      • Regarding function call overhead, attribute access, and object creation, the idea behind “OPy” is to address those things for Oil, although I haven’t gotten very far along with that work :-)

                                        http://www.oilshell.org/blog/tags.html?tag=opy#opy

                                      I guess the bottom line is that we’re both stretching Python beyond its limits :-/ It’s a nice and productive language so that tends to happen.

                                      1. 1

                                        Thank you for the context! It is… eerie that we seem to have gone down similar rabbit holes with embedding/distributing Python!

                                        You may be interested in https://github.com/indygreg/python-build-standalone, which is the sister project to PyOxidizer and aims to produce highly portable Python distributions. There’s still a ways to go. But you may find it useful as a mechanism to produce CPython build artifacts (and their dependencies) in such a way that can easily be recombined into a larger binary, such as Oil.

                                        1. 1

                                          The general coprocess protocol looks really interesting. It might be nice to surface it somewhere more visible and trackable…github wiki pages are notoriously bad for being able to keep up with changes to docs like this.