1. 18

I don’t know whether to laugh or cry. I’m amazed, on the one hand, what people use Jenkins for, but I’m appalled that this horrible, awful tool has not yet been replaced. Deeply flawed design with plugins that break on every minor release, but I guess as a general-purpose hammer, it does the job. sigh.

  1.  

  2. 10

    This made me wish I could find a suckless.org take on the task Jenkins does. Just a web interface to read the output from and initiate makefile jobs that can be hooked into cron, without the crazy java web application complexity frenzy.

    1. 5

      Make is a recipe for complexity, just less well managed. Cron I trust about as far as I can throw it (e.g. its error reporting is rather fragile in practice).

      I think there are pretty much two halves to how Jenkins is used:

      • Executing a script on a schedule or according to some trigger. Rundeck can replace this part (there are probably plenty of similar options, but if your server infrastructure is geared towards “deploy this .war” then Rundeck fits in nicely) and is much simpler.
      • Build these jobs while understanding the relationships between them (in particular, “build a snapshot of this project and all downstream projects on every commit”). Sadly I don’t think anything other than Jenkins has as good maven integration.
      1. 2

        What do you use instead of cron?

        Edit: Corrected auto-correct correction

        1. 1

          What’s from?

          1. 2

            Stupid auto-correct!

            I meant to say “What do you use instead of cron?”

            1. 2

              Rundeck

    2. 8

      I’ve gotten a lot of mileage out of just treating Jenkins as a distributed shell script executer. With Jenkins Job Builder, actually creating the jobs is really easy. And kicking them off is a curl away. I’m not sure how well it scales, but for what I’ve done, it’s quite good at it. I’ve found the biggest source of issues with Jenkins I’ve run into is not Jenkins, per se, but people who want Jenkins to be more than it is and make use of all the plugins, which just makes it near impossible to understand what is going on. So if you use Jenkins, avoid plugins unless you really really really need them.

      Jenkins is still terrible, it’s a huge mess, it consumes way more resources than is reasonable considering what it does. But, in moderation, it can be quite effective at what it does.

      1. 5

        Emphasizing really strongly that you should keep using what’s already working, if you’re truly just using Jenkins as a shell script executor, then something like Builtbot might be worth a look. The downside is the lack of a nice GUI for configuration, the upside is that the configuration format is much easier to read and edit by hand, and quite extensible. Mozilla and Apple are both big Buildbot users, as is Mercurial, which is how I encountered it. (I also personally find writing Buildbot extensions easier, both because it’s written in Python and because the extension mechanism makes more sense to me than Jenkins plugins, but my Python background may be biasing me really badly there.)

      2. 4

        As someone whose entire job revolves around undoing the sins performed with buildbot (a python CI service similar to jenkins) this post terrifies me.

        1. 4

          I set up a buildbot a couple years ago, and I’ve frankly been quite pleased with it. Almost all of the maintenance has been on the build slaves, which is what I wanted. I’ve barely touched the build master, but now I’ve come to a point where I need to give it more serious attention.

          Given your experience, what do people get wrong when they implement a buildbot? I certainly have some constructs in my build master I now find baffling.

          1. 11

            (This became way longer than a typical lobsters post. I apologize for the lack of brevity, but hopefully the production insights make up for it.)

            I work on the Chromium CI system which consists of 84 public masters at the moment. This number is growing, but we are actively building our way to a buildbot-less system in the future. Most of the pains of buildbot revolve around processing and configuration in the master, the implications of a single-process python architecture, and a single long-lived TCP connection. From what I understand some of these issues have been addressed in buildbot 0.9.x but we have outgrown buildbot itself long ago.

            Part one: processing and configuration in the master

            Buildbot typically encourages you to specify your configuration steps in the master config as part of a factory. As the Chromium factory grew in complexity, our team ran into several problems.

            One is that doing anything required a restart of the master process. We had relied on buildbot’s “reconfig” functionality for a bit, but discovered that some of our code would hold pointers to pre-reconfig objects and that generally caused havoc. So adding a compiler flag, adding a builder, and adding or removing a bot all required a complete restart of the master process. It also meant you couldn’t turn things off immediately during an emergency situation.

            Another is that it made configuration changes nearly impossible to test. In order to verify that anything worked, you had to start up an entire frigging master which we do on presubmit and simulate a build which we also do on presubmit in order to see if the master will work properly. If it doesn’t, reverting takes another restart…

            Finally, if you write any logic inside the master, that becomes impossible to reproduce by a developer without kicking off some kind of build. We used to have things like performance result uploads and test output parsing done directly on the master, and that made it tough to run locally (or debug new ones being developed). The worst offenders were custom buildsteps which guarantee you will not be able to diagnose, reproduce or debug in the wild.

            We’ve written a master-side parser called the annotator which parses annotations directly in the target process’ stdio. These let us add steps to the buildbot waterfall on the fly, and we have built an entirely buildbot-free system called recipes which a) run locally no problem and b) emit annotations so steps show up on the buildbot. Since these are with the source checkout, they can be updated at any time without a master restart (among many other benefits).

            Part two: single process architecture

            Because buildbot is based on python, it can only run one real thread due to the GIL (no matter how much twisted magic you can muster). This means that the poller, build scheduler, emailer, web interface, everything runs on a single thread. Specifically, the streaming text logs and root json endpoint have the ability to bring a master to its knees. It may be difficult to tell from the low-granularity latency graph but that master’s response to an (effectively) null http query jumps from 500ms to 3s during the weekday just because of the logs it is shuttling back and forth. Occasionally we have had unsuspecting developers melt our systems by dumping a 250MB Android logcat straight into stdio. With 50+ bots doing that every hour plus several hundred other bots providing their own logs, it can bring the entire system to a halt.

            Logs are just part of the problem, web access can be tricky too. The root json can be quite costly to calculate if you have many builders/bots, and buildbot’s default force-eviction SyncLRUCache has bad properties under certain build access patterns. I’ve personally modified buildbot’s root json calculation and cache access patterns such that the master wouldn’t get destroyed under our everyday load.

            Finally, this single-threadedness can affect management of buildbot itself. Admin and monitoring pages become inaccessible, so if the system goes down you’re unable to determine why. Before we moved build-notification emails out of the master process, it would take one of our masters 15 minutes just to email everyone about their downed builds before it would finish shutting down. The frequent need for restarts combined with waiting 15 minutes at a console really add up.

            Part three: single long-lived TCP connection

            Buildbot uses a single, long-lived TCP connection to talk to its bots. Further, once a TCP connection is closed the master has no way of reconnecting to a bot running a build. This means the build is effectively lost; the bot will terminate the build and attempt to reconnect to the master. This build termination combined with the frequent need for restarts stated in part one is the real nightmare. It means you can’t make any system changes until after everyone is done using the build server. Our builders typically get around 40 build requests per hour at peak (those are ‘try’ jobs and can’t be merged), so we typically wait until 7pm Pacific to do any maintenance.

            The long-lived TCP connection also means we can’t do intelligent load-balancing, failover or canarying. There is a way to have multiple buildbot masters query off the same build database, but the complexities of routing to to each one (and managing them, keeping them in sync) wasn’t worth the cost for us. Finite, terminating HTTP requests go a long way here (and most modern load-balanced systems work this way).

            Final thoughts and recommendations

            I want to point out that I think the buildbot people are fine engineers, and we clearly have outgrown their product. If you’re on a medium-sized team it does a lot of stuff for you pretty nicely, but if you scale beyond that this is how it will hurt you.

            That said, some recommendations:

            • Keep your build configuration in source as much as possible.
            • Keep as much out of the master as you can.
            • Keep as much web traffic off the master as you can. We built chromium-build.appspot.com to be a frontend/aggregator, and that has significantly reduced the web load off of our masters.
            • Try to upload logs to S3/Google Storage/whatever and print a link to stdio instead of dumping all your logs to stdio. The more master i/o you conserve the better.
            • Don’t write custom build step classes.
            • Dump the environment and configuration information at the top of the stdio so a developer can reproduce a build or step as closely as possible.
            • If you find yourself restarting masters a lot, try to make it as painless as possible. I wrote a declarative master restart system that can schedule and execute master restarts with a single command. It relies on a lot of systems we’ve built, but perhaps you can take its ideas and apply it to your own setup: master manager and master manager launcher

            Hope this was helpful.

            1. 2

              Wow, helpful doesn’t do your comment justice. I appreciate your time writing that up. Some responses/learning:

              When we started, our build master ran on our server, but our build slaves ran on our laptops. The design where the slaves initiate a connection to the master was life-saving here, as we could shut down and move around our machines and things would reconnect afterward. I become unhappy very quickly with the lack of encryption between the master and slave, which was one of the earliest pieces of work I did on our buildbot. Now of course our build slaves are running on servers too, so we’ve outgrown that feature. We never got build teardown working properly, and just added a garbage collection step to the beginning of our build to clean out any cancelled builds.

              This design, along with the threading issues in Python, are also present in SaltStack. Really a lot of what you wrote also applies to that tool. There I still use the slave-initiated handshake though, for minions (i.e., SaltStack slave machines) that are behind NAT. But I see all the same scalability issues there too, for the same design reasons. It’s difficult to test the daemon, I/O transfer bogs down the system, and minions are randomly unresponsive.

              As to your recommendations, I’ve been very pleased with how little build configuration we did put in the master, your recommendation matches my experience. The first bit of master refactoring I did reduced the number of build steps by removing things the build master wasn’t absolutely required to accomplish.

              Your need to stand up a build master during the build is not something I’ve needed, but I have that problem broadly in my infrastructure: system testing is too painful, bringing up and shutting down daemons being a significant pain point. I’ll certainly look at the patterns you’re using.

              The web traffic recommendation is new to me, but answers a question I had. My buildbot build logs are ok, but I want to improve them. In my case I need some additional cross referencing to better see where one job ends and another begins to better summarize the build. I wondered whether to try to make buildbot do that. Since my build assets are already using non-buildbot I/O, I’ll follow your recommendation and do the same for build artifacts.

              1. 2

                Two things that may be helpful after reading your post:

                • For web traffic, it’s helpful to log each request when it comes in and is processed to better see what is happening.
                • You’re right that having the bot connect to the master is a better model than the other way around. We took that to its limit with our new swarming architecture. It is a lightweight task queue that works as a stricter and more controllable version of buildbot (but without a lot of its other features such as polling or unified build view). One nice thing is that it uses isolate to give each task a content-addressed SHA. This means the entire test can be reproduced exactly, and that swarming can de-duplicate binaries it has already run. Swarming/isolate are the subject of another huge post, and I’m happy to go into detail elsewhere. We use this system inside buildbot for now (using buildbot’s poller and build view, but swarming for task execution) and it’s a good middle-step to getting off buildbot entirely.

                If you want to know more you can reach the infra team at infra-dev@chromium.org!

        2. 1

          I agree with the Jenkins sentiments here – has anyone tried http://concourse.ci/?

          1. 1

            Last time I looked at it (~4 months ago), I just couldn’t grok how it worked or how to set it up. Seemed fairly tied to Cloud Foundry – or at least how CF is set up (BOSH, et al), but I admit to not spending much time on it. I need to take another look at it though, since I like the thinking behind it.