1. 50

  2. 11

    I’ve found Buildkite to strike the right balance between flexibility and configurability. Unfortunately, it is quite pricey. Everything else is needlessly complex.

    1. 6

      Buildkite have per-user pricing with BYO computer, while most CI providers pricing is based on compute-time used by builds.

      For some organizations, including $WORK, that works out much more cheaply.

      We have a very small team working on quite a large app, with a test suite that benefits tremendously from parallelism. Spinning up large spot instances to run tests is about $0.30 per build, and saves 30-40 minutes of developer time vs our previous provider (who provided a small number of slow computers to run our tests and charged much more than we pay buildkite).

      1. 5

        +1 to Buildkite. We’ve been using it for quite a while at Shopify and the flexibility it provides is great for us.

    2. 10

      I built a CI system a while ago; I haven’t quite finished/released it yet, but the concepts are pretty simple:

      1. GitHub sends a webhook event.
      2. Clone or update the repo.
      3. Set up a sandbox.
      4. Run a script from the repo.
      5. Report back the status to GitHub.

      And that’s pretty much it. Responsibility for almost everything else is in the repo’s script (which can be anything, as long as it’s executable). The CI just takes care of setting up the environment for the script to run in. You can build your own .tar images or pull them from DockerHub.

      Overall there’s very little “magic” involved and actually works quite well. Aside from various minor issues, one of the big things I need to figure out is to at least have a plan to add cross-platform support in the future.

      Perhaps this won’t cover every CI use case – and that’s just fine; not everything needs to cover every last use case – but it probably covers a wide range of common use case. I just want something that doesn’t require having a Ph.D. in YAML with a masters in GitHub actions. I used Travis before but I ran out of free credits after they changed their pricing strategy a while ago; so I tried using GitHub actions but setting up PostgreSQL for the integration test failed for reasons I couldn’t really figure out, and debugging these kind of things is a horrible time-consuming experience: make a minor change, pray it works, push, wait a few minutes for the CI to kick in, deal with that awful web UI where every click takes >2 seconds, discover it failed again for no clear reason, try something else, repeat 20 times, consider giving up your career in IT to become a bus driver instead. Setting up PostgreSQL and running go test ./... really shouldn’t be this hard.

      At any rate, writing the initial version of the above literally took me less time than trying to set up GitHub actions. One of the nice things is that you could write a program to parse and run that GitHub Actions (or Travis, Circle-CI, etc.) YAML file if you want to – it’s really flexible because, as you wrote in the article, the basic idea is to just provide “remote execution as a service”.

      1. 5

        so I tried using GitHub actions but setting up PostgreSQL for the integration test failed for reasons I couldn’t really figure out, and debugging these kind of things is a horrible time-consuming experience: make a minor change, pray it works, push, wait a few minutes for the CI to kick in, deal with that awful web UI where every click takes >2 seconds, discover it failed again for no clear reason, try something else, repeat 20 times, consider giving up your career in IT to become a bus driver instead. Setting up PostgreSQL and running go test ./… really shouldn’t be this hard.

        Kinda seems like the point of the article, you should be able to run exactly what’s in the pipelines locally to troubleshoot, but at that point why not collapse it into the local build system and unify them

        1. 3

          Run a script from the repo.

          This is how I try to use GitHub actions, by keeping the yaml minimal and only launching the single script there. Here’s a representative example: https://github.com/matklad/once_cell/blob/master/xtask/src/main.rs

          The bit where this annoyingly falls down is specifying the host environment. I can’t say, from within the script, “run this on windows, mac, and Linux machine”, so this bit still lives in yaml. The script, if needed, contains match current_os.

          A more complex case for this failure is if I need coordination between machines. I have a single example for that (in rust-analyzer’s ci). Release binaries for the three OSes are build by three different builders. Than, a single builder needs to collect the three results and upload a single release with multiple artifacts.

          Though the last example arguably points to the problem in a different system. Ideally, I’d just cross-compile from Linux to the three OSes, but, last time I checked, that’s not quite trivial with Rust.

          1. 3

            Ideally, I’d just cross-compile from Linux to the three OSes, but, last time I checked, that’s not quite trivial with Rust.

            Back when I maintained a bunch of FreeBSD ports I regularly had people send me patches to update something or the other, and they never bothered to actually run the program and do at least a few basic tests. Sometimes there were runtime errors or problems – sometimes the app didn’t even start.

            That it compiles doesn’t really guarantee that it also runs, much less runs correctly. Rust gives some harder guarantees about this than for example C or Python does, but if you try to access something like C:\wrong-path-to-user-profile\my-file on startup it can still crash on startup, and you’ll be shipping broken Windows binaries.

            For my Go projects I just set GOOS and hope for the best, but to be honest I have no idea if some of those programs work well on Windows. For example my uni program does some terminal stuff, and I wouldn’t be surprised if this was subtly broken on Windows. Ideally you really want a full Windows/macOS/etc. environment to run the tests, and you might as well build the binary in those environments anyway.

            1. 3

              I do test in different envs. Testing is comparatively easy, as you just fire three completely independent jobs. Releases add a significant complication though.

              You now need a forth job which depends on the previous three jobs, and you need to ship artifacts from Linux/Mac/windows machines to the single machine that does the actual release. This coordination adds up to a substational amount of yaml, and it is this bit that I’d like to eliminate by cross compilation.

              1. 1

                Can’t you use test job to also build the binaries; as in, IF test succeeded THEN build binary? Or is there a part that I’m missing?

                1. 1

                  Yeah, I feel like I am failing to explain something here :)

                  Yes, I can, and do(*) use the same builder to test and build release artifacts for a particular platform. This is not the hard problem. The hard problem is making an actual release afterwards. That is, packaging binary artifacts for different platform into a single thing, and calling that bundle of artifacts “a new release”. Let me give a couple of examples.

                  First, here’s the “yaml overhead” for testing on the three platforms: https://github.com/matklad/once_cell/blob/064d047abd0b76df31b0d3dc88d844c37fc69dd1/.github/workflows/ci.yaml#L5. That’s a single line to specify different builders. Aesthetically, I don’t like that this is specified outside of my CI build process, but, practically, that’s not a big deal. So, if in your CI platform you add an ArpCI.toml to specify just the set of OSes to run the build on, that’d be totally OK solution for me for cross platform testing.

                  Second, here’s the additional “yaml overhead” to do release:

                  Effectively, for each individual builders I specify “make these things downloadable” and for the final builder that makes a release I specify “wait for all these other builders to finish & download the results”. What I think makes this situation finicky is the requirement for coordination between different builders. I sort-of specify a map-reduce job here, and I need a pile of YAML for that! I don’t see how this can be specified nicely in ArpCI.toml.

                  To sum up:

                  • I like “CI just runs this program, no YAML” and successfully eliminated most of the YAML required for tests from my life (a common example here is that people usually write “check formatting” as a separate CI step in YAML, while it can be just a usual test instead)
                  • A small bit of “irreducible” YAML is “run this on three different machines”
                  • A large bit of “irreducible” YAML is “run this on three machines to produce artifacts, and then download artifacts to the fourth machine and run that”.

                  Hope this helps!

                  (*) a small lie to make explanation easier. I rather use not-rocket-science rule to ensure that code in the main branch always passes test, and release branches are always branched off from the main branch, so each commit being released was tested anyway.

                  EDIT: having written this down, I think I now better understand what frustrates me most here. To do a release I need to solve “communication in a distributed system” problem, but the “distributedness” is mostly accidental complexity: in an ideal (Go? :) ) world, I’ll be able to just cross-build everything on a single machine.

          2. 3

            How does this compare to sourcehut’s CI? They use YAML, but you can mostly avoid it. For example here are Oil’s configs, which mostly invoke shell scripts:


            I think you are describing a “remote execution service”, as the blog post calls it. That’s basically what I use sourcehut as.

            I think such services are complementary to what I described in a sibling comment. Basically a a DAG model (as the OP wants) and an associated language in “user space”, not in the CI system itself. If most of the build is in user space then you can debug it on your own machine.

            1. 2

              I didn’t look too closely at sourcehut as I don’t like sourcehut for various reasons.

              I don’t think you need any sort of DAG. If you want that, then implement it in your repo’s build system/script. The entire thing is essentially just “run a binary from a git repo”.

              1. 3

                I didn’t look too closely at sourcehut as I don’t like sourcehut for various reasons.

                Hi, Martin. May I know what are those reasons?

                1. 2

                  I don’t really care much for the sourcehut workflow, and I care even less for the author and his attitude. I don’t really want to expand on that here as it’s pretty off-topic, but if you really want to know you can send me a DM on Twitter or something.

                  1. 1

                    but if you really want to know you can send me a DM on Twitter or something.

                    i will even do it before you reply :)

                2. 2

                  Yes that’s compatible with what I’m doing. Both my Travis CI and sourcehut builds just run shell scripts out of the git repo. And then they upload HTML to my own server at: http://travis-ci.oilshell.org/ . So it could probably run on your CI system.

                  I want parallelism, so I ported some of the build to Ninja, and I plan to port all of it. Ninja runs just fine inside the CI. So I guess we’re in agreement that the CI system itself can just be a dumb .

                  Although, going off on a tangent – I think it’s silly for a continuous build to re-clone the git repo every time, re-install Debian packages, PyPI packages, etc.

                  So I think the CI system should have some way to keep ephemeral state. Basically I want to use an existing container image if it already exists, or build it from scratch if it doesn’t. The container image doesn’t change very often – the git repo does.

                  Travis CI has a flaky cache: mechanism for this, but sourcehut has nothing as far as I can tell. That makes builds slower than they need to be.

                  1. 3

                    Although, going off on a tangent – I think it’s silly for a continuous build to re-clone the git repo every time, re-install Debian packages, PyPI packages, etc.

                    So I think the CI system should have some way to keep ephemeral state. Basically I want to use an existing container image if it already exists, or build it from scratch if it doesn’t. The container image doesn’t change very often – the git repo does.

                    Yeah, the way it works is that you’re expected to set up your own image. In my case this is just a simple script which runs xbps-install --rootdir [pkgs], frobs with a few things, and tars the result. You can also use DockerHub if you want, golang:1.16 or whatnot, which should be fine for a lot of simpler use cases.

                    You can then focus on just running the build. The nice thing is that you can run ./run-ci from your desktop as well, or run it on {Debian,Ubuntu,CentOS,Fedora,macOS,…}, or use mick run . to run it in the CI.

                    Setting these things up locally is so much easier as well; but it does assume you kind of know what you’re doing. I think that’s a big reason for all these YAML CI systems: a lot of devs aren’t very familiar with all of this, so some amount of abstraction makes it easier for them. “Copy/paste this in your YAML”. Unfortunately, this is a bit of a double-edged sword as it also makes things harder if you do know what you’re doing and/or if things break, like my PostgreSQL not working in GitHub (and besides, you can probably abstract all of the above too if you want, there’s no reason you can’t have an easy-image-builder program).

                    Splitting out these concerns also makes a lot of sense organisationally; at my last job I set up much of the Travis integration for our Go projects, which wasn’t strictly my job as I was “just” a dev, but it was a mess before and someone had to do it. Then after the company got larger a dedicated sysadmin was hired which would take of these kind of things. But sysadmins aren’t necessarily familiar with your application’s build requirements, or even Go in general, so their mucking about with the build environment would regularly silently break the CI runs. Part of the problem here was that the person doing all of this was extremely hard to work with, but it’s a tricky thing as it requires expertise in two areas. I suppose that this is what “devops” is all about, but in reality I find that a lot of devops folk are either mostly dev or mostly ops, with some limited skills in the other area.

                    When this is split out, the ops people just have to worry about calling run-ci and make sure it runs cleanly, and the dev people only need to worry about making sure their run-ci works for their program.

                    Anyway, I should really work on finishing all of this 😅

                    1. 1

                      That makes sense, but can you build the image itself on the CI system?

                      That’s a natural desire and natural functionality IMO. And now you have a dependency: from run-ci to the task that builds the image that run-ci runs on! :) From there it is easy to get a DAG.

                      Oil has use case discussed above too: you build on say a Debian image, but you want to test on an Alpine image, FreeBSD image, OS X image, etc. And those images need to be built/configured – they’re not necessarily stock images.

                      That sounds like a DAG too.

                      So I think there is something like an “inner platform” effect here. If you build a “simple” CI system, and lots of people use it, it will turn into a DAG job scheduler. And if you’re not careful, it might have the unfortunate property of only being debuggable in the cloud, which is bad.

                      I have noticed a similar design issue with cluster managers. A lot of times people end up building a cluster manager to run their cluster manager: to distribute the binaries for it, authenticate who can do so, to run jobs that maintain the cluster itself, etc.

                      So a CI is supposed to run build systems, but the it turns into a build system itself. I think a lot of the ones people are complaining about started out small, with tiny configuration (just like sourcehut), and then they grew DAGs and programming languages in YAML :-/ If sourcehut wants to satisfy a lot of use cases, it’s probably going to run into that problem.

                      1. 1

                        but can you build the image itself on the CI system?

                        Sure, there’s nothing really preventing you from doing that.

                        I suppose you could see it as a DAG; your oil repo depends on oil-images which builds the images, which depends on a base alpine/freebsd image as a bootstrap. Personally I think that’s shoehorning things a little bit; it’s a very “linear” graph: oiloil-imagesalpine|freebsd|macOS and probably not really worth thinking about in terms of a DAG IMHO.

                        At the end of the day I think that no matter what you do, if your requirements are somewhat complex then your solution will be too. There’s tons of CI systems out there, and while I feel many are a bit lost in the zeitgeist of YAML programming, most are probably built by teams which include people smarter than me and if they haven’t found a good way to solve complex use cases then I probably won’t either. So the best we (or rather, I) can do is let you solve your own complex use case without interfering too much, which will usually be easier than applying a generic solution to a complex use case.

                        1. 1

                          Yeah I’m not sure what the right solution is, just nodding my head at the porous line between CI systems and build systems. Neither Travis CI or sourcehut have a DAG, so I think for all practical purposes, it should be kept in “user space”, outside the CI system, and in the build system.

                          I do think the “ephemeral state” problem is related and real. Travis CI has cache: but it’s flaky in practice. I’m toying around with the idea that images should be stored in something like git annex: https://news.ycombinator.com/item?id=26704946

                          So it would be cool if the CI system can pull from git annex, and then a CI job can also STORE a resulting image there, for a future job. I’m not sure if any existing systems work that way. I think they mostly have CANNED images – certainly sourcehut does, and I think Travis CI does too.

                          So in that way maybe you can build a DAG on top, without actually having the DAG in the CI system. If you can “reify” the image as part of the CI system itself.

                          1. 2

                            The way I do cache now is mount /cache, which is shared across the repo and you can do with that as you wish. It’s extremely simple (perhaps even simplistic), but gives people a lot of flexibility to implement some cache system based on git annex for example.

              2. 1

                nektos/act lets you test GitHub Actions locally.

                There’s even a (quite large) Ubuntu image available that mirrors the Actions environment.

              3. 8

                We gave up on CI complexity and said enough is enough. Our job system already has all the complexity we need. We run Nomad, our CI system just executes nomad job files. So the CI system is basically as simple as ‘nomad run build.nomad’ on every commit.

                Then Nomad does all the actual work, and we monitor the nomad jobs, because we already have monitoring in place for nomad.

                1. 1

                  Interesting, I was debating that in my head reading the article and what the author was laying out (and not specifying K8s). Presumably your monitoring for failed jobs catches failed test runs, etc then?

                  1. 2

                    right, I mean we use Nomad(but the same could be done with k8s, whatever). Since we already use it for stuff we care about(read our internal services, our external services, etc), then we already care about jobs that fail, etc.

                    The same amount of caring is also true of our CI/CD jobs, i.e. WE CARE. but we don’t need to care differently, really. The people involved in the caring might be different, but that’s just the boring bits.. :)

                    You could(and we did for a while) have a our CI/CD system execute nomad run build.nomad and then run nomadwatch <allocationid> (where the allocation id is output from the nomad run command)

                    and nomadwatch was just some crappy wrapper script that would watch a job allocation go through it’s paces and tail stdout and stderr of the job run. This way the CI/CD UI could continue to be used…

                    We eventually stopped that too, because it broke once and we didn’t care enough to fix it. Since we already have prometheus & grafana and whatever else watching for nomad job failures, and prometheus alertmanager can yell at us in our chat room(s) on said job failures.

                2. 6

                  If you haven’t already read this research paper, you’ll find it amazing.

                  The most academically interesting breakthrough was to realise that build systems can be split into something that decides what to rebuild, and something that orders the rebuilding, putting build systems in a two-dimensional table.

                  1. 5

                    I very much agree with the thesis that CI systems have evolved into awkward build systems configured with YAML.

                    Build systems should be written in a “real” language, and the CI should simply be dag-build-system //target (I think NInja is a good candidate; see below). You shouldn’t have 2 different configurations written in 2 different amalgamations of languages. It should be more like a program written in one language.

                    FWIW I think our thinking is pretty aligned, since I used Bazel for many years (and even contributed the Python build rules a long time ago, when my teammates were developing “Blaze”)

                    I find it ironic that everyone complains about what a bad language shell is, but YAML is the new “cloud” shell language, in that it coordinates processes/tasks.

                    Becaue YAML is way worse than shell! It embeds shell, so it has all those problems, and it adds a whole bunch of other ones (the “Norway problem”, etc.)

                    Related thread: 2021 will be the year of shell and YAML :-(

                    @indygreg I think there is a way to get where you want to be. I don’t think we should have to wait for Github/Microsoft to implement something like this.

                    I see two better strategies:

                    1. Fork something like sourcehut’s continuous build to add DAG-like functionality. (And eventually contribute it back I would hope.) I’m using it, and it’s nice although minimal. It’s a good starting point. I have seen people talk about running their own sr.ht build nodes, but I haven’t followed it closely.
                    2. (my preferred option) Develop a system that can run on multiple clouds. This way you don’t have to spin up your own remote execution service, which is a lot of work and maintenance.

                    Oil’s continuous build already does this! It runs on Travis CI and sourcehut. It’s a big shell script with “manual” parallelism: http://travis-ci.oilshell.org/

                    I purposely do not use all the bells and whistles of each system, to avoid lock-in. The only thing I use is package installation and the build images, and I believe those can be replaced by user space / rootless containers easily (notes below).

                    Originally I wanted to add my own DAG engine to Oil, but I think that’s a bit too much work rigt now. So I started using Ninja recently and it turned out great. Ninja isn’t perfect since it works on timestamps rather than being content-based, but there’s a cool hack in the link below with OSTree.

                    The way I ported to Ninja was to separate all my shell scripts into “build-actions.sh” and then a Python program that generates the DAG. I would like to use Oil as a configuration language, but Python is good enough for bootstrapping. (And of course Starlark was literally Python to begin with.)

                    So basically the strategy I have is (1) to get something to work with shell, Python, and Ninja on MULTIPLE clods – that’s been proven to work already. And then (2) Make it a little nicer and more integrated with the Oil language. Oil will have the declarative part for the DAG, and the imperative part for the build actions.

                    Basically I’m using Oil’s own continuous build as a motivating case for the Oil language. In my mind, a CI is a parallel shell script – it coordinates disparate tools that you didn’t write, on multiple machines.

                    The other part that’s missing is something like bubblewrap (user space container tool) for enforcing dependencies. Bazel can already use Linux containers (optionally), but I think bubblewrap will work nicely. I wrote something like it 5-6 years ago, but at this point I’d want to reuse the work of others.

                    Comments/questions on this are welcome! It is a lot of work, but I think everything mostly exists, and there’s some code running already. I could use help! The Oil build is big and a very concrete use case. It has dozens of tasks and more than 10K lines of shell.


                    • I commented about this a few months ago on this Travis CI shutdown thread, but unfortunately the blog post and coding on it got derailed. However I do expect to work on this simply because Oil needs a continuous build! That requirement is never going away, so it will happen “eventually”.
                    • Ninja and OSTree – I liked this strategy, and it appears to be deployed in a company, for a related use case: building embedded device image. It’s less work than building something like Bazel, which is huge.

                    Bazel also has the deficiency that it encourages you to rewrite your whole build system in its language. This works in a corporate setting but it doesn’t work in open source.

                    Bazel is too fine-grained; I think we want something coarse grained based around containers instead. Two reasons for that claim: (1) you don’t want computing build metadata to be slower than doing a build from scratch, and (2) fine-grained parallelism works better over LAN than WAN. I want something that will work across multiple clouds and slower networks, not something that assumes a tightly coupled cluster.

                    1. 5

                      CI was always a build system for me, nothing else. Its just old wine in new bottle.

                      1. 5

                        I appreciate that the author described CI platforms as “remote code execution as a service”, and I generally feel like CI and build systems both fall into the bucket of tasks that reduce to a job scheduling system. I’ve seen at least one company abuse Jenkins in this fashion, building web UIs for build, CI, cron, and many other tasks that just submitted Jenkins jobs. And I expect to eventually find some CI system that’s just a UI on top of the Kubernetes Job API.

                        A corollary to this is that any sufficiently generic job scheduling service will eventually be used (or abused) for CI.

                        In my case, working in scientific computing, I’ve frequently observed developers running their CI jobs on actual supercomputers simply because they had an easy-to-use API for submitting parallel jobs. Completely failing to use the expensive high-speed low-latency network or the many-petabyte storage system, and ignoring all the scientific simulation jobs getting queued because they wanted to trigger a new 10-minute CI run on every commit… 😉

                        1. 4

                          I don’t get it, the article starts by saying CI systems are too complex, and then goes on to recommend Taskcluster, which sounds even more complex.

                          1. 1

                            I read that differently: the author goes on to say that it’s clearly for power users and they wouldn’t recommend it for most people.

                          2. 4

                            I’ve had 2 ideas around CI systems:

                            1. Make a super opinionated system based around Bazel. Basically your code builds in bazel and the CI just runs build and run targets. Caching is pretty much solved out of the box because of bazel and multi language is supported to a degree. This is similar to the post.
                            2. Create a flexible system using Lua as config. All these yaml languages end up being a PITA, so why not just use an actual language that can be sandboxed? Handling a DAG of tasks is running some coroutines. Obviously it’s not super simple because you want to have visualization and debugging, but at least you have actually IDE support.
                            1. 2

                              (1) is how Google’s presubmit works. It’s been effectively configuration-free in my experience.

                              The insight here is that the build system, whether Bazel or something else, already has a dependency graph. If the CI can reuse this dependency graph, and if you add test targets as well as build targets, you get very granular caching and can also skip unaffected tests on each run.

                              The thing that makes this work is the exhaustive declaration of all dependencies in Bazel (‘hermeticity’). I’m not sure that this would work with something like Make, where incorrect dependencies leads to “nothing to be done for …” rather than immediate build failure.

                              (2) sounds sort of like xmake. I haven’t tried it myself though.

                              1. 1

                                (1) Yeah that’s similar to how it worked at Amazon as well with brazil ws --dry-run, which also ran automatically when you submit a code review. It would just run the default build and test targets. The part where this gets a bit trickier with bazel is how to handle integration tests that require external services, e.g. postgres or redis. You still need some way of defining that, whether it’s a docker-compose config or NixOS server base or something else entirely. That also breaks the hermetic nature of the tests since you can e.g. forget to cleanup the DB in your tests.

                                (2) Oh that looks interesting, I’ll have to take a look. Now that I’ve thought about it a bit more, I’m wondering how to avoid that ending up like Gradle or Jenkins. Both have too big of an API surface area and jenkins in particular suffers fro being difficult to reproduce locally due to plugins and manual configuration. The other big issue there is plugin conflicts due to the Java classpath. I think Lua avoids some of these problems since it can be embedded in a binary and requires explicit imports of other files. I think some other problems can be avoided to some extent by ensuring it can only be configured via code and being more batteries included.

                              2. 1

                                I mentioned a couple deficiencies of Bazel in a sibling comment here.

                                It works pretty well for a tightly coupled corporate codebase, and particularly C++, but I don’t think it works that well in open source. Or even for most small/medium companies, which tend have more heterogeneous codebases that “delegate” to many tools/libraries.

                                For example, many people will want to just use NPM in a container. Last I checked (and it’s been awhile), if you want NPM-like things in Bazel you’ll be in for a rude awakening. Most package managers conflict heavily with its model.

                                1. 2

                                  Yeah when I envisioned my Bazel based CI it was specifically for Java server applications, which has the best Bazel support besides C++. For Java servers, you’re just taking dependencies on other projects and libs rather than being depended on directly. This idea was partially due to my frustrations with Gradle, which makes my head spin every time I look at the API docs.

                                  I think the other important piece is that when you focus on a single language, you can more easily do what’s mentioned elsewhere in the thread where you have a super tight end to end integration. You can have code coverage, linting, format checking, and deployment (to some extent) working without needing to set it up yourself.

                              3. 3

                                creates vendor lock-in as users develop a dependence on platform-proprietary

                                Aaaaand that’s why

                                It felt like drone.io was a step towards having something that could possibly be a standard, but I haven’t closely followed it, open source, interoperability, config that’s a sub/superset of docker-compose all seems good

                                Still requires a “phd in yaml” though as another poster mentions

                                Seems to have been commercialized but not familiar with the company

                                1. 3

                                  The problem I have with modern CI systems personally is deciphering them. If you just want to understand how a project is built and tested, working back from their unique CI config is often complex and confusing. In some ways I would much rather see a build.sh or build.bat file that runs on both CI and my own machine. For example, setting environment variables for config via a build matrix, not terrible for configuring CI, but could equally well be done in a shell script which I can then run on either CI or locally.

                                  1. 2

                                    I posit that a sufficiently complex CI system becomes indistinguishable from a build system

                                    And a follow-on to this is that eventually the secrets of how to build software become sprinkled across disparate systems so fixes/adjustments necessitate changes in multiple locations/systems with the resultant combinatorial increase in testing all the things.

                                    1. 1

                                      Honestly, I never knew that a build system and a CI system where supposed to be different things.

                                      Isn’t continuous integration just about having a good enough build system to easily re-run everything?

                                      1. 1

                                        I used to work on codeship.com and came to a very similar conclusion. I wanted a system which just executed DAGs of arbitrary work. Where a node (task) specified a runtime required to execute it (like a handler) and nodes could join to execute these tasks. Nodes would only claim tasks for which they could support the tasks runtime. Nodes would cooperate to ensure the constraints of the DAG were upheld in terms of execution order and the result of dependent tasks.

                                        My feeling was someone could solve their CI needs bringing bringing their own nodes with the task “runtimes” they so desired to resolve their CI graph needs.

                                        This become my little hack project https://github.com/georgemac/adagio (Like all good projects, I burnt time before writing code giving it a stupid punny name: a DAG io)

                                        I lost a bit of steam with it recently, but I still think about I every now and again. I got a bunch of the features I wanted implemented. Like the ability to describe retries in the face of different error scenarios. Nice to see this post go by and get me thinking about it again.

                                        1. 1

                                          Just throwing this out here as it may interest some of you: https://github.com/kristapsdz/minci

                                          1. 1

                                            My companys product is available for quite all major linux distributions aswell as windows. It is not an cloud native application and consists of server+client component. So for testing the base functionality we basically have to spin up a complete virtual machine, containers are not sufficient because we also have to have access to direct luns and iscsi stuff (VTL tape).

                                            At the moment im running 6 systems based on centos/libvirt/kvm and im using a dead simple setup with vagrant, which, if the build for one distribution has been done, spins up the required virtual machines to install the latest build version and executes the testsuite.

                                            The virtual machines are pre-configured using regular shellscripts (via vagrant provisioning) and the testsuite is implemented in python pyunit, at the moment having roughly around 8000 lines of python code and is executing around ~180 testcases for each setup, for some virtual machines we have on the fly creation for virtual tape libraries that are attached to the virtual machine for usage, using quadstorvtl in the backend.

                                            The boxes are spinned up using jenkins and some scripting, the virtual machine configuration (vagrant config) beeing part of an git repository. Im using sphinx with the docstrings plugin to provide the developers a neat link to the tests documentation and source, in case one testrun fails. The logfiles of the test run are saved to a log repository for each build.

                                            As a side project i provide a web-based frontend using python/flask which uses the same vagrant configurations as the CI virtual machines for the developers, where they can easily spin up a virtual machine with the latest build version installed to reproduce any failing tests.

                                            I really have to give big kudos to the roboxes project and Petr Ruzicka for providing such excellent pre-configured vagrant boxes for all major distributions and especially windows systems!!!

                                            https://github.com/ruzickap/packer-templates https://github.com/lavabit/robox