Still really curious about a few things here
Unfortunately, the complexity of the state in queued builds meant we couldn’t just drop data, but had to rewrite parts of our unresponsive database.
What exactly does that mean? Why were they coupled?
we reduced capacity [in the load balancer] to throttle the hooks naturally they significantly outnumbered our customer traffic, making it impossible for our customers to reach us and effectively shutting down our site.
How exactly did they implement this? Not sure what LB set up they have, but surprised they couldn’t deploy hostname/request based blocking.
Not sure I could do better, and I really don’t want to armchair quarterback - just confused about the circumstances & how that led to choosing various tactics to resolve the issue.
I found two things interesting in this post. But I understand that this is the only thing I’ve read about CircleCI so it’s not enough to make firm statements about anything. A bit ranty.
MongoDB.
Hey Coda I understand it doesn’t have a great reputation, but I’m not too familiar with it. What about mongo specifically would make a queue have really awful performance/affect more than that table specifically? Do you have any idea whether mongo was the reason they couldn’t just drop queued builds on the floor?