1. 15

Still really curious about a few things here

Unfortunately, the complexity of the state in queued builds meant we couldn’t just drop data, but had to rewrite parts of our unresponsive database.

What exactly does that mean? Why were they coupled?

we reduced capacity [in the load balancer] to throttle the hooks naturally they significantly outnumbered our customer traffic, making it impossible for our customers to reach us and effectively shutting down our site.

How exactly did they implement this? Not sure what LB set up they have, but surprised they couldn’t deploy hostname/request based blocking.

Not sure I could do better, and I really don’t want to armchair quarterback - just confused about the circumstances & how that led to choosing various tactics to resolve the issue.

  1.  

  2. 6

    I found two things interesting in this post. But I understand that this is the only thing I’ve read about CircleCI so it’s not enough to make firm statements about anything. A bit ranty.

    1. I found it interesting that they seem to make use of live-code-patching. What I find funny about this is that they are a company devoted to continuous integration yet they by-pass it, according to them, frequently enough to script it. Given that this is a company promoting CI, this should raise red-flags. The company is not eating their own dogfood. This doesn’t mean CI itself is wrong, but it’s one of those “lead by example” situations.
    2. As described in here, the architecture seems to have some serious problems. The operations around the queue appear to be very expensive, which would raise serious red-flags to me (having expensive queue operations defeats the purpose of having a queue). They also had no concept of back pressure. Given that every company these days is building a distributed system, why are these not basic primitives everyone builds on top? They say that they were working on it and they just hit this before they could implement that. This is something I struggle with a lot though: I read this and feel sad because these particular problems seems like obvious beginning points anyone building a distributed system should be starting with, but they aren’t. The software industry culture is very much build-now-scale-later and most people interpret that as build-now-figure-out-how-to-scale-later, rather than build-something-scalable-now-and-scale-it-later. I think this impedes progress in the industry. But I don’t think the industry necessarily wants progress, as most people boil this down to making a profit or not. And in the end, I don’t think this downtime will even be visible on CircleCI’s EOY financials, so their approach to building the company will not hurt them, so in that sense I am clearly wrong in what I value based on what a capitalistic economy values. I don’t know where that leaves me.
    1. 6

      MongoDB.

      1. 3

        Hey Coda I understand it doesn’t have a great reputation, but I’m not too familiar with it. What about mongo specifically would make a queue have really awful performance/affect more than that table specifically? Do you have any idea whether mongo was the reason they couldn’t just drop queued builds on the floor?