1. 17
  1.  

  2. 11

    As the developer of a version control tool (Mercurial) and a (former) maintainer of a large build system (Firefox), I too have often asked myself how - not if - version control and build systems will merge - or at least become much more tightly integrated. And I also throw filesystems and distributed execution / CI into the mix for good measure because version control is a specialized filesystem and CI tends to evolve into a distributed build system. There’s a lot of inefficiency at scale due to the strong barriers we tend to erect between these components. I think there are compelling opportunities for novel advances in this space. How things will actually materialize, I’m not sure.

    1. 1

      I agree also , there is quite a bit of opportunity for innovation around this. I am thinking at a slightly different angle.

      There is an opportunity for creating a temporal aware file system, revision control, emulation environment, build system. All linked by same temporal time line. A snapshot, yes. but across all these things.

      take a look at https://sirix.io/

      Imagining a bit, but it could server as a ‘file system’ for the emulation environment. It could also enhance version control system, where the versioning/snapshotting happens at sirix.io level

      While I am not working with 100s developers these days, I am noticing that environment/build control is much easier in our Android development world – because we control a) emulator b) build environment

      So it is very reproducible: same OS image, same emulator (KVM or Hyper-V), same build through gradle (we also do not allow wildcard for package versions, only exact versions).

      Working on backend with, say C++ (or other lang that rely on OS provided includes/libs) – very different story, very difficult to replicate without introducing an ‘emulator’ (where we can control a standardized OS image for build/test cycle).

    2. 8

      This has happened in “big commercial software development” but it’s not clear (to me) how it would work in open source:

      https://en.wikipedia.org/wiki/Rational_ClearCase

      A distinguishing feature of ClearCase is the MultiVersion File System (MVFS), a proprietary networked filesystem which can mount VOBs as a virtual file system through a dynamic view, selecting a consistent set of versions and enabling the production of derived objects.

      Caching Function Calls Using Precise Dependencies (about the Vesta SCM / build system): http://www.vestasys.org/doc/pubs/pldi-00-04-20.pdf

      We consider the problem of implementing a pure functional language in which some function calls are expected to be extremely costly. This problem arises in the context of the Vesta software configuration management system, a system for managing and building potentially large-scale software [6, 13]. As an integrated version control and build system, Vesta provides several advantages over traditional version control systems like RCS and CVS, and over the build program Make.

      Vesta’s build language is a full-fledged programming language that is functional (that is, side-effect free), modular, dynamically typed, and lexically scoped. Its value space contains booleans, integers, text strings, lists, closures, and bindings.

      Since you mentioned Bazel, FWIW the people who developed it and the accompanying VFS (that worked immutable content-addressed files) were on my team at Google way back in 2005-2006. There was a joke from some of our older coworkers that we were “re-inventing ClearCase”, and the Vesta system was pointed out several times, since many early employees at Google worked at DEC (acquired by Compaq).


      FWIW I do agree that some kind of distributed content-addressable store (which sounds like IPFS or DVC) is a prerequisite for making it happen in the open source world.

      But I think one problem is making it portable across dev environments / OSes, which are pretty homogeneous in big companies but heterogeneous among git / Mercurial users. And the fact that caches can’t be trusted, at least without auth, and auth is a big hairy problem.

      Mostly I avoid the issue by avoiding building huge C++ projects in the open source world :-/ That’s where these integrated VCS / build systems really shine. And that might not be a bad thing. Open source projects can be and generally should be smaller and more decoupled than corporate projects.

      1. 1

        Ha, yes, ClearCase is what I always think about- as crufty as it became, it still seems to encompass pretty much all the key use cases.

        1. 1

          Some quick googling does not turn up the shortcomings of ClearCase. I found a few rants but most of them are more about their setup or project management than the tool.

          One issue with ClearCase seems to be that branching/versioning is file-based instead of the snapshot-based way evereybody loves since Subversion.

          1. 3

            Directory versioning can cause exciting moments, let’s just put it that way.

            Clearcase is slooow, and the versions I worked with did not have atomic commit of a change set. So you might commit 50 files, but only 49 went through. Hopefully you don’t have a conflict!

            Clearcase also has the notion of exclusive checkouts. This causes all kinds of usability issues when someone has a checkout, but hasn’t gotten around to unchecking it out.

            Clearcase also was Extremely Unix, and working with it on Windows machines was suboptimal. So that collides heavily with a lot of professional embedded tools that we were using.

            Clearcase was also very much designed in the mid 80s with mid 80s assumptions about networks, etc.

            It was eventually bought by IBM via an M&A chain and largely left to moulder.

            Clearcase did do binaries very nicely. Clearmake allowed building manifests for what exactly went into a build. Generated binaries are tied into the version control system as artifacts.

            A rebuild of ClearCase using modern source control concepts (changesets, 3 way diff), would be fricking fantastic for enterprise level work, it’d obsolete git at the enterprise level. TBH I suspect Google has, essentially, but they haven’t released it.

            I went deep into clearcase years and years ago, I have forgotten a few of the terms, but they’d come rushing back. Happy to answer further questions.

            1. 2

              Interesting.

              I also believe that it would be interesting at the enterprise level. However, not so much for Open Source projects, so why develop an Open Source tool for this case? The audience would be big projects like Apache, Mozilla, Android and such projects are hard to migrate. One could probably make a startup out of this. Maybe others like PlasticSCM are already ahead there.

              How does ClearCase interact with the normal file system? Their website suggests that you would primarily use a web interface or a virtual file system integrated in your IDE. Eventually, some parts of a repository have to become normal files though. Is it like Subversion where you check out subdirectories to work only on parts of a repo?

              How does ClearCase interact with tools like compilers? These tools require normal files as input. Does it just break if you forget to checkout some relevant parts of the repo? How do the build manifests track the tooling? For example, if you update your compiler, how do you update the build artifacts? The modern monorepo build tools like Bazel requires the compiler to be part of the repo. Did they already do it like that in the 80s?

              1. 4

                Sorry for the delay, it was a busy family-filled weekend.

                Yes, this is something that would be appropriate for enterprises or mega OSS projects. A startup would work, if it could survive the enterprise sales cycle. I’d probably look at doing an Open Core business model. The centralization would ideologically offend many in the modern OSS world. ::shrug::

                How does ClearCase interact with the normal file system? Their website suggests that you would primarily use a web interface or a virtual file system integrated in your IDE. Eventually, some parts of a repository have to become normal files though. Is it like Subversion where you check out subdirectories to work only on parts of a repo?

                ClearCase mounts volumes in Linux and Windows, the commands based on a specific “config spec” select the versions you want from a specific VOB, which then turn into file-system lookalikes. This is a strict superset of Subversion’s approach.

                Note, if I was reimplementing this, I’d also go the config spec route and checkout subsets of the repository, and have our modern knowledge of patch algebras address the versioning of the checkouts and how they interact with the centralized system. Certainly in large systems we’ve reimplemented these ideas in half baked ways (git submodules, hg subrepos, etc. I did a hg extension called guestrepos that had a very config speccy idea years ago).

                How does ClearCase interact with tools like compilers?

                Files are in the volume. Clearmake hooks into the driver and audits what goes through it.

                These tools require normal files as input. Does it just break if you forget to checkout some relevant parts of the repo?

                IIRC, either (1) yes, or (2) you don’t have a full description of what tools and their versions went into your system.

                How do the build manifests track the tooling? For example, if you update your compiler, how do you update the build artifacts?

                Rerun a build, generate a new derived object type that is logged in the vob.

                The modern monorepo build tools like Bazel requires the compiler to be part of the repo. Did they already do it like that in the 80s?

                Well, certainly in the 00s in the Clearcase shop I was in, we did.

                1. 1

                  Thank you. Your comments contain more detailed information than their whole website. :)

      2. 7

        My ADHD-inattentive kicked in about 10 lines in -

        With that bias exposed: what is this article even about? Is there a problem the author is attempting to solve? There is most certainly a gap -

        “So my conclusion is that there is a gap which none of the current Free Software tools can fill.”

        But the gap is actually two, both of his own creation:

        1. “Why don’t we use the same data store for both versioning and building?”

        and, now that we’re living in the author’s world,

        1. “[…] now that we store the source code and the build artifacts in the same storage, why do we use different tools?”

        There doesn’t appear to be a solution, because there never was a problem. Certainly not one - of which I am aware or that the author didn’t just make up out of the blue - upon which I would “invest a decade of work on.”

        Am I missing something? Did I completely miss some bit of irony or sarcasm in the original post?

        1. 6

          You are right, the problem is not well specified. The problem is software development at scale. If you work with a hundred developers and a repository of a few gigabytes, then git reaches its limits. We use submodules, LFS, sparse checkouts, and shallow clones. It is not pretty. Yes, I know that Microsoft made it work somehow, but knowing little details I believe they sacrificed a lot of git.

          For a monorepo approach with only using Free Software, I would try Subversion and Bazel, but I had not much opportunity to pursue that combination yet. Subversion certainly has some well known problems and it does not get much attention to improve it. Continuing these thoughts, I wanted to write down one aspect of it. Thus, this blog post.

          1. 1

            Thank you for your reply, and your honesty.

            In continuation of writing down one aspect of a thing to more clearly define a problem/solution: perhaps an easily attainable first step could not be storing binaries or other build artifacts in a version control system?

            1. 1

              I agree with that sentiment, using separate version control systems. Source code and binaries go together like oil and water. In my case, I don’t want to keep every version of a binary, only the versions I’ve tagged as release. In the case of source code, I do want to keep every version I’ve committed so I can compare and revert if needed. In the past, I’ve switched out my individual repository with newer, best-of-class (CVS to Subversion for source code, and Maven Central + local repositories to Artifactory for binaries). I also was able to migrate from Maven to Ant+Ivy for the build process. I feel like if I ever went with a single monolithic build manager, I’d lose the flexibility of these interchangable pieces. My builds are still reproducible because my code revisions are marked in the release metadata of each binary, and a complete archive of the source code and documentation is also bundled alongside each corresponding binary - that is just the usual Java approach.

              1. 2

                Don’t you sometimes push something to CI and unit tests fail there? Do you debug that by building it again locally?

                1. 2

                  Yes, frequently. Jenkins marks such builds as failed and they show up red on the dashboard. More often, it is an integration test that fails further down the pipeline, and those are easy to spot as well. The unit tests can also be run in the local development environment prior to committing to the repository, but I’ve found that it is faster to let Jenkins handle it. The results of those unit tests and builds can be retrieved back into the local development environment via Mylyn. It is a closed loop.

          2. 2

            If you think about large repositories

            is a pretty big problem in my current life as a developer in a monorepo that is home to hundreds of “micro”services & scratchpad of hundreds of full-time engineers that make hundreds of commits/workday. We use git and pants, and it still /sorta/ holds together, but definitely past obnoxious at this point.

          3. 4

            I use nix for a project and the way I do it:

            • The source IS the artifact, there is no build step.
            • Installing the system onto a machine and building it are the same thing because of the way nix package manager works.
            1. 3

              If you want to go all out with this you could use --pure-eval, which enforces reproducible evaluations.

            2. 3

              As far as I know, this has been done before, namely by Vesta. Unfortunately, the website makes it horribly difficult to figure out how it all works together.

              1. 3

                Yes! I’ve never used it, but once I downloaded the source code because some co-workers mentioned it:

                https://lobste.rs/s/fosip5/should_version_control_build_systems#c_bkq3ve

                IIRC it was like 200K lines of C or C++ … quite a big effort! The paper I linked there gives a description.

                It sounds pretty cool. There is always an argument whether the build language needs to be a “real” programming language and they answered with the affirmative there (correctly IMO). Much better than Make, which started out as a config language and then grew into a really bad Lisp.

                IIRC the GNU Make Standard Library uses something like Peano numbers since it doesn’t have real integers. And Android even used that for several years.

              2. 3

                Sort of. What we need is not only repeatable builds and repeatable dependencies but a complete dependency graph so that we can recompile only what actually needs to be recompiled. But this unfortunately will require us to leave the familiar world of interface points that consist of a dead hierarchical tree of blobs without meaning. This concept is something most programmers are so invested and immersed in that I shudder to think of how bad our situation gets before we’re willing to abandon it.

                We waste so much expensive dev time and electricity by building all our software through one-way gates where text goes in, it’s parsed, and eventually opaque blobs come out but all the metadata is thrown away, unable to be used by any other parts of the machine.

                1. 1

                  Machine inefficiency is just not expensive enough for us to care, and the argument you stated comes down to “it’s not smart, we need to get enlightened”. We will get better solutions, but not by saying the past was stupid.

                  1. 1

                    I’m not saying the past was stupid at all. Just that (leaving aside the issue of feeding back compiler insights to programmers), in order to make serious progress in deterministic and reliable builds, we need to proveably know what “goes into” any step of the build process, and additionally, which parts of the input taint which parts of the output. And we can never get there while the interface between steps is just a posix-ish filesystem, and the build agents can reach out and fetch information the earlier steps in the pipeline don’t know they’re using.

                    1. 1

                      There isn’t really a reason you can’t dump lots of objects to disk to explain choices. Just because the tooling doesn’t do that doesn’t mean the foundations are bad.

                  2. 1

                    Caching and distributed building exists since decades. Unfortunately existing implementations and integration are quite poor.

                    Reproducible builds is a requirement for better distributed building: https://wiki.debian.org/ReproducibleBuilds

                    1. 1

                      Sure, but a) we’re not caching enough separate steps, and b) we’re indexing/invalidating our caches by what we think is the material going in to produce a result, and we’re wrong most of the time, otherwise nobody would ever need to do a clean build.

                      1. 1

                        I agree, that’s why I wrote that the existing implementations are quite poor. The https://reproducible-builds.org/ project is now making builds deterministic and then sign, share and compare the artifacts. It’s a necessary step to be able to implement more fine-grained caching.

                  3. 3

                    Related to merging VCSes, but not content-addressability: I was at an industry event for IBM i users. I noticed that throughout the event, people were talking about “change management” tools. I had initially took this to be a reference at VCSes like git, but then I realized sessions talking about it and vendors trying to sell it were referring to something bigger in scope.

                    From what I gathered, in this world, a change management tool isn’t really just a version control tool with auditing, but many of them appear to be application lifecycle management tools, tracking source and other kinds of object like database schema (which makes sense on i because the DB is integral to the operation of the system and the reason d’etre of applications; not so much the Unix world, where things like this are outsourced to ActiveRecord and the ilk), building them, and then deploying them and managing environments. In a way, it seems actually quite “devops”, somewhat in the tradition of the (increasingly nebulous) word. It’s an interesting concept if it is what I think it is.

                    However, it seems these tools aren’t as common as they’d make you believe. Leadership at companies forces harmony in tooling like git and Jenkins, and i developers aren’t used to the “stick a bunch of tools stuck together” model, combined with having to rebuild what these tools offer with that. And then a lot of companies don’t even seem to use any tools for this, just developing on production…

                    I wonder how well this works and how common it really was; if it actually provides structure and process appropriate for a system integrated more tightly than Unix’s grab-bag of tools that we could learn from, or if it’s some bureaucratic nightmare developers actually hate but suits and salesmen love.

                    1. 3

                      As a point of inspiration/reference, Monticello from the Smalltalk world comes close to what you are describing.

                      1. 2

                        There are a lot of assumptions in this blog post: (1) the whole world uses git, (2) no existing build systems are cross-platform. I use Artifactory for artifacts, Ant and Ivy for building and dependency resolution, and Subversion for code version control. Jenkins performs continuous integration, deployment and unit testing. My build system runs on Windows, Linux and Mac without any changes, and it is entirely open source. Of course, my stack is Java, so horses for courses.

                        1. 2

                          Could elaborate a little more on your setup? How large is the repository? How many people work in it? Do you perform change-based code reviews?

                          1. 2

                            I have a mini write-up here. There are 7 developers from two organizations committing to the code repository, and two other organizations have access to the build artifacts but not the source code repository. Our code reviews are automated through Checkstyle, FindBugs and a few other static analysis plugins that run during integration builds and unit tests

                        2. 1

                          How about this: pseudo files in the VCS that have “parents” in the files they need. Possibly a metadata attribute to specify how to produce the artefact. It would merge the two functionalities together.