1. 77
  1.  

  2. 17

    Fantastic article. Computers are indeed really fast and if you structure the system for the problem while minimizing assumptions then as the article says you could be doing innovative things.

    Now I’m inspired to write about the autoscaling CI system we built at Zenefits running on AWS spot instances that would grind through 20k+ tests for every pull request.

    1. 3

      As someone who is currently working through reworking a CI system, I would love to read something like that!

      1. 4

        I’ll look into moving it up the post queue but not making any promises.

        What are the constraints and requirements for the work you’re doing?

        1. 3

          Nothing nearly as complicated. It’s a Node.js monorepo with an old Jenkins instance that is very slow/crash prone. Mostly just want to make it reliable (and maybe a bit faster) and add pull request building.

          The bit about the AWS spot instances is what caught my eye. I have been looking into them to allow us to have beefier build agents without spending way too much money.

          1. 5

            Makes sense. I think Jenkins has a plugin or two for controlling spot instances. Something to look into if you’re considering spot instances for cost and performance reasons (assuming you haven’t yet).

            One thing we did at Zenefits that was good from a performance perspective was pre-baking the build agents with snapshots of the code and all the other required bits and pieces. So as soon as the agent was online it could do a git pull and be off to the races.

            The other thing was using LXC containers for isolation to pack more build agents per VM. The controller for the build agents was Buildkite. Not saying use Buildkite but I can’t recommend those folks enough. They saved us the hassle of building our own agent/workflow controller.

      2. 2

        As someone else who works in CI/CD, I too would love to hear some more detailed info about this.

      3. 4

        So we let it all out. Our devices produced an average of 50 MB of (uncompressed) logs per day, each. For the baseline 100,000 devices that we discussed above, that’s about 5TB of logs per day. Ignoring compression, how much does it cost to store, say, 60 days of logs in S3 at 5TB per day? “Who cares,” that’s how much. You’re amortizing it over 100,000 devices. Heck, a lot of those devices were DVRs, each with 2TB of storage. With 100,000 DVRs, that’s 200,000 TB of storage. Another 300 is literally a rounding error (like, smaller than if I can’t remember if it’s really 2TB or 2TiB or what).

        I don’t follow this at all. How is the storage on the endpoints relevant? Were the centralized logs somehow reflected back out?

        1. 19

          As I understand it, the argument that if you’re already buying 200,000TB of storage for the client devices, you’re operating at a scale where you can easily afford to pay for another 300TB of storage for your logs, so you shouldn’t be worrying about the cost of the log storage.

        2. 2

          PRINTK_PERSIST patch to make Linux reuse the dmesg buffer across reboots.

          Even if you don’t do any of the rest of this, everybody should use PRINTK_PERSIST on every computer, virtual or physical. Seriously. It’s so good.

          Interesting. Looks like the patch was sent 7 years ago.

          even after a kernel panic or non-panic hard lockup, on the next boot userspace will be able to grab the kernel messages leading up to it.

          Has anyone tried the patch, or do recent kernels have any such feature?