Still really curious about a few things here
Unfortunately, the complexity of the state in queued builds meant we couldn’t just drop data, but had to rewrite parts of our unresponsive database.
What exactly does that mean? Why were they coupled?
we reduced capacity [in the load balancer] to throttle the hooks naturally they significantly outnumbered our customer traffic, making it impossible for our customers to reach us and effectively shutting down our site.
How exactly did they implement this? Not sure what LB set up they have, but surprised they couldn’t deploy hostname/request based blocking.
Not sure I could do better, and I really don’t want to armchair quarterback - just confused about the circumstances & how that led to choosing various tactics to resolve the issue.