1. 73

  2. 24

    I suggested doing this several times at multiple jobs I had, always to be met with a negative reaction by coworkers that want their bloated build servers because “that’s how it should be”.

    Since I first say Jenkins more than a decade ago, featuring forms with hundreds of inputs, it has puzzled me and people hate the question: so it’s running a script on commit?

    Thank you for doing this.

    1. 10

      This was fun to come up with and I believe it’s a classic case of demystifying by building it yourself. Of course this is not a Jenkins replacement, but it is very tangible to get something that’s good enough. All in all the CI discussion nowadays seems to revolve more around people/companies wanting to pay someone to manage their servers and services rather than doing it themselves.

      1. 7

        I think this is pretty neat, and something I may use for my own projects, but to be honest I’d probably have a negative reaction if someone would suggest using this at $dayjob, especially if there are more than a few developers.

        In my experience, a lot of devs just aren’t that familiar with this kind of devops-y stuff, aren’t very good at really “grokking” shell scripting, and kind of hate dealing with this kind of thing.

        It’s just another server to maintain, and if something goes wrong another script to debug. On the face of it, there are probably a few things I can think that can go wrong in the current version of this script leading to inconsistent states. And what if you want/need one of the more advanced features Travis offers, like testing against multiple Go/Python/Ruby/… versions or running multiple builds in parallel? Also, this script isn’t “stateless” and some sort of state on the server left over from a previous built (or modified manually!) could influence the build in all sorts of ways.

        I guess what I’m mostly missing is something that sits in-between this and Jenkins and the like. I looked at GitHub actions a while ago, and the entire things just seemed like “shell scripting with YAML syntax” to me, which didn’t exactly impress me. Not that I think shell scripting is wonderful, but this is IMO worse.

        1. 10

          I think a huge contributor to this problem is that, to be honest, ops tooling is awful. People are rarely writing rigorous testing around their stuff (orders of magnitude less rigor than what you see in application development), there’s a huge amount of hand waving and mysticism, and so many tools are amazingly painful to integrate with other tools.

          There’s a huge space available for “rails but for ops” (or maybe “python but for ops”?), and it sure isn’t k8s.

          I personally wish this stuff were easier to manage and maintain. k8s/docker is totally “wrap the ball of mud in a box”, which is great if you accept you have to have a ball of mud. But I feel like we can do better.

          1. 2

            Isn’t that basically what Heroku and Dokku are meant to be? I’ve never used either so I don’t know how it ends up in practice.

          2. 3

            Then they want GitHub/GutLab integration and radiator views and all of this ends up reimplemented in Python until it converges toward BuildBot…

            I hate Jenkins as much as the next guy, but agree that this is likely too simplistic for big use. For personal projects, a cool experiment, though running tests locally may just be enough.

            (The worst thing about Jenkins is its tendency to run all Git branches against a newly-added job. I’m pretty sure there’s a 12-year-old-or-so bug report about it.)

            Having said that, I loved reading this piece as an LFS-type “this is where it all starts” thing!

            1. 5

              Well, from a startup / small team perspective, you could implement this as a quick way to get started with CI until you grow out of it. There’s nothing wrong with writing bespoke software for a specific need if it takes less time than wrangling some “off the shelf” solution. As long as you’re willing to throw it away later when you’ve outgrown its limitations, rather than digging in and reimplementing the world.

              1. 2

                There’s nothing wrong with writing bespoke software for a specific need if it takes less time than wrangling some “off the shelf” solution.

                It almost always does. Set up one of those off the shelf build servers, and it will take you a couple of weeks before people start asking for whole days sometimes week to setup feature-of-the-day on build-server-du-jour.

                It’s just that those sinking time getting an arthritis on their index finger by setting up on their fancy CI admin panel, lack the awareness that actually and properly learning shellsctipt would have much greater time ROI. Because they end up never doing it.

                Notice how pretty much all the criticism for this solution present on this very story here on lobsters could easily be solved with the obvious solutions listed in the end of the article.

        2. 11

          Neat! You could easily eliminate the Redis dependency by using files.

          The post-commit hook could add a build job by creating e.g. jobs/pending/UUID.txt with the job spec. Each build runner would watch that directory for changes, and claim the job with the earliest last-modified time by moving it to e.g. jobs/running/UUID.txt. It would run the job spec, and finish the job by moving the file to jobs/{success,failed}/UUID.txt.

          If you’re reasonably careful, this would support arbitrarily many build runners. Build status would be find jobs/ -name UUID.txt. Jobs in the pending directory have a queue wait time of the delta between the last-modified time and the current time. You’d get easy job output streaming: the build runner would tee -a UUID.txt all of their commands, and watchers would tail -F UUID.txt.

          Simple job success/fail notifications could be implemented via a trap in the job spec. More sophisticated notifications could be implemented by watching the job output stream, and triggering events on e.g. regexp matches.

          A reaper could watch for any jobs in the running directory that hadn’t been modified in awhile, and weren’t open by any active process, and move them to the failed directory, or maybe back into the queue.

          1. 8

            My first thought was “why not use files?” as well, but the biggest concern I’d have with it is that file locking and atomicity are kind of hard on unix systems, and something very few people fully understand in all its nuance (I sure don’t), and that you need to be very careful with this especially once you start implementing some of the things in the “Possible further improvements” list.

            All in all, it seems that Redis is simple and light-weight enough, and it eliminates source of bugs that’s usually confusing and hard to track down. I’m all for avoiding dependencies, but not to the point of making things stupid light.

            1. 7

              What I suggested doesn’t require locking, just atomic mv, so keep the jobs/ tree on a single volume and you’re golden.

              Redis is lightweight compared to other stateful servers, but it’s extremely heavyweight compared to a filesystem, and opens the system up to entirely new classes of risk.

              1. 1

                Off-topic but I appreciated your “stupid light” link. I’m moving to Colorado tomorrow and it helped me think through / validate some gear choices I’m making.

              2. 3

                In my mind going file-per-job was my first approach, alluded to in “strictly speaking this can be achieved without additional software”. But Redis felt like a very reasonable choice, especially in terms of thinking about scaling this to more than one worker machine or just building on a machine that’s distinct from the git server.

                Your design is clean and straightforward, of course. And the point about build output streaming is very good, as is the point about adding a reaper for stale builds. I guess it’s just a fun problem space to think of, so many interesting things to solve.

                1. 5

                  Build farms would be easy: have the runners on the git server run all their job specs via ssh to a random host in the farm. Everything else works transparently.

                  It’s always good to avoid dependencies, if you can. Especially runtime dependencies. In this case Redis is serving as a source of truth for key-value data with a few simple access patterns, which is more or less the definition of a filesystem ;)

                  1. 5

                    I like your ssh idea, simple and elegant.

                    Instead of plain files, which others have pointed out can have surprising edge cases or race conditions, why not SQLite? After all, SQLite competes with fopen(). And it’s relatively easy to use from shell scripts, much like the redis-cli examples from the original article.

                    1. 6

                      SQLite would be a good choice in terms of taking care of any locking / data race issues and because some sort of results backend is needed anyway.

                      1. 1

                        Why introduce a large dependency if you can avoid it? There are no edge cases or race conditions in the system I’ve described.

                        1. 4

                          You would need to take extreme care to never create a job file outside of the initial push hook, only move and append to an existing file. Accidentally duplicating a job into multiple states would be bad. And guarantee a single writer at all times. Interleaving output from multiple workers would corrupt your state files.

                          Shells don’t make these things easy to accomplish in scripts, not with usual output redirection anyway. Even if you’re not using a shell, but a full featured programming language, you’d always need to be careful about such consistency details whenever interacting with your file tree.

                          SQLite has primary keys, foreign keys, and transactions, i.e. robust tools for managing consistency. It’s hardly a large dependency, and does not require a persistent server.

                          1. 1

                            perhaps managing the job files in their own git repo would address this…

                          2. 2

                            Not really an edge case or race condition but if your script does the BLPOP but then does not complete (machine dies, script crashes, network glitches, etc.), that job is lost forever - there’s no way for another run to pick that up as incomplete which there would be with a more persistent store like SQLite.

                            1. 2

                              Couldn’t you catch this by inspecting the contents of jobs/running? It’s not clear to me how SQLite would give you more information than queued/running/completed.

                              I agree SQLite would be a reasonable way to store this information though. It might be more portable than relying on filesystem metadata and would certainly make it easier to run analytics.

                              1. 1

                                My suggestion involves no persistent processes like Redis.

                                1. 1

                                  Yeah, I mis-replied.

                          3. 1

                            I once build something similar file-based relying on NFS for distribution. I don’t remember anymore what happened to it. The Kerberos token timeout was ugly but I it worked.

                        2. 3

                          Another approach I’ve used in a small project with infrequent pushes is to use the batch command to queue the job to run later, when load is low. An upside is built-in notification through the standard mail facility.

                          1. 1

                            Very nice! Thanks for mentioning it, this totally fits the pattern I was going for.

                          2. 3

                            IMO a better architecture would be to use a named pipe; post-commit hook is to write to it, and the job server reads from it and spawns a job in response.

                            1. 1

                              You wouldn’t want the hook to block waiting for a runner; can you make a named pipe with a buffer, so you could write without a reader?

                              1. 1

                                You could use systemd socket activation with Accept=yes on TCP socket. This would automatically handle all incoming messages and provide proper isolation primitives (by proper configuration of service). Seems like win-win for me with just 3 files needed to handle everything (ci.socket, ci.service and shell script in for running tasks).

                            2. 1

                              Unix fifos would be even cooler here.

                            3. 2

                              I really enjoyed reading this and it’s gotten me thinking quite a bit. Thanks!

                              1. 2

                                I’m glad. Thanks for your feedback!

                              2. 2

                                I’m not convinced this is a good solution. This still doesn’t solve the visualizations that something like Jenkins offers out of the box.

                                1. 7

                                  This is not meant to be a complete solution, but more of a pattern. I like to take things apart and show that at their core they are not that complicated. Obviously this has a very poor UI, because I reckon it is not likely you can rewrite Jenkins in 20 lines of bash. Some people don’t need Jenkins, but rather would have an elegant solution to self-host a repository and have some tests run against that when they push, maybe trigger a deployment when the tests succeed. It does have advantages as well: it’s relatively easy to understand and modify. The constraints are also quite clear. And the you are scripting this in your programming language of choice.

                                2. 1

                                  This inspired me to throw together something to publish notifications on a pubsub system; this one is bash, publishes to NATS using the nats CLI tool, needs jq and is running under Gitolite; for other systems, you’ll need a different way to get $GL_REPO (effectively, the repo slug) (and also $GL_USER):


                                  Should be trivial to adapt to mqtt or whatever; if you need authentication, add it to the NATS_ARGS array. This was a quick hack which I’m now running at home, having put it into my gitolite admin repo’s local/hooks/ area (which is a disabled-by-default feature with security implications if you don’t trust the writers to that repo with shell access).

                                  1. 1

                                    Nice idea and pretty straightforward, but I really love the concept of running CI in containers since I first used Drone and I think that might still add so much complexity here that you might as well just use a bigger solution.

                                    1. 1

                                      I do mention containers in the sandboxing section at the end. I’m not sure it would be such a big overhead, it’s certainly achievable.

                                      While Drone is great, I’m a bit uneasy about them having been acquired recently.

                                      1. 1

                                        Walking the line between “cloning inside the container”, “mounting the repo inside the container” and whatever else is not so easy. There’s a reason Gitlab CI has some quite complicated caching logic for that.