1. 47
    1. 29

      Why Twitter didn’t go down … yet

      I was hoping for some insights into the failure modes and timelines to expect from losing so many staff.

      This thread https://twitter.com/atax1a/status/1594880931042824192 has some interesting peeks into some of the infrastructure underneath Mesos / Aurora.

      1. 12

        I also liked this thread a lot: https://twitter.com/mosquitocapital/status/1593541177965678592

        And yesterday it was possible to post entire movies (in few-minute snippets) in Twitter, because the copyright enforcement systems were broken.

        1. 5

          That tweet got deleted. At this point it’s probably better to archive them and post links of that.

          1. 11

            It wasn’t deleted - there’s an ongoing problem over the last few days where the first tweet of a thread doesn’t load on the thread view page. The original text of the linked tweet is this:

            I’ve seen a lot of people asking “why does everyone think Twitter is doomed?”

            As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks.

            1. 12

              It wasn’t deleted - there’s an ongoing problem over the last few days where the first tweet of a thread doesn’t load on the thread view page.

              It’s been a problem over the last few weeks at least. Just refresh the page a few times and you should eventually see the tweet. Rather than the whole site going down at once, I expect these kinds of weird problems will start to appear and degrade Twitter slowly over time. Major props to their former infrastructure engineers/SREs for making the site resilient to the layoffs/firings though!

              1. 2

                Not only to the infra/SREs but also to the backend engineers. Much of the built-in fault-tolerance of the stack was created by them.

          2. 2


            I have this URL archived too, but it seems to still be working.

          3. 1

            hm, most likely someone would have a mastodon bridge following these accounts RT-ing :-)

        2. 2

          FWIW, I just tried to get my Twitter archive downloaded and I never received an SMS from the SMS verifier. I switched to verify by email and it went instantly. I also still haven’t received the archive itself. God knows how long that queue is…

          1. 2

            I think it took about 2 or 3 days for my archive to arrive last week.

      2. 2

        oh, so they still run mesos? thought everyone had by now switched to k8s…

        1. 13

          I used to help run a fairly decent sized Mesos cluster – I think at our pre-AWS peak we were around 90-130 physical nodes.

          It was great! It was the definition of infrastructure that “just ticked along”. So it got neglected, and people forgot about how to properly manage it. It just kept on keeping on with minimal to almost no oversight for many months while we got distracted with “business priorities”, and we all kinda forgot it was a thing.

          Then one day one of our aggregator switches flaked out and all of a sudden our nice cluster ended up partitioned … two, or three ways? It’s been years, so the details are fuzzy, but I do remember

          • some stuff that was running still ran – but if you had dependencies on the other end of the partition there was lots of systems failing health checks & trying to get replacements to spin up
          • Zookeeper couldn’t establish a quorum and refused to elect a new leader so Mesos master went unavailable, meaning you didn’t get to schedule new jobs
          • a whole bunch of business critical batch processes wouldn’t start
          • we all ran around like madmen trying to figure out who knew enough about this cluster to fix it

          It was a very painful lesson. As someone on one of these twitter threads posted, “asking ‘why hasn’t Twitter gone down yet?’ is like shooting the pilot and then saying they weren’t needed because the plane hasn’t crashed yet”.

        2. 8

          Twitter is well beyond the scale where k8s is a plausible option.

          1. 2

            I wonder what is the largest company that primarily runs on k8s. The biggest I can think of is Target.

            1. 3

              There’s no limit to the size of company that can run on kube if you can run things across multiple clusters. The problem comes if you routinely have clusters get big rather than staying small.

            2. 1

              Alibaba, probably.

              1. 1

                Oh, I didn’t realize that was their main platform.

            3. 1
              1. 2

                I was thinking about that too, but I’m guessing that CFA has a fraction of the traffic of Target (especially this time of year). Love those sandwiches though…

        3. 2

          Had they done so, I bet they’d already be down :D

        4. 1

          I work at a shop with about 1k containers being managed by mesos and it is a breath of fresh air after having been forced to use k8s. There is so much less cognitive overhead to diagnosing operational issues. That said, I think any mesos ecosystem will be only as good as the tooling written around it. Setting up load balancing, for instance . . . just as easy to get wrong as right.

    2. 14

      do all substacks do the annoying “subscribe” modal while i’m trying to read it?

      1. 10

        I think so. When medium started doing that to readers that was when I stopped posting on medium.

        1. 6

          As a reader, thank you. That stuff drives me crazy, even if it’s just a click to dismiss it.

      2. 1

        Not all, but they “encourage” subscription. It’s not as bad as Medium, though. A Reader mode provided by most browsers can deal with it. Or if you need, there’s a rule for adBlock somewhere.

    3. 5

      The work that went into automating things is really nice, and I’m sure it was a very interesting set of challenges to solve.

      So now, the team at Twitter has a bunch of tools telling them what servers are failing, maybe why… but you still need humans to go and fix things. It always boils down to humans going there and replacing faulty hardware, or humans investigating why a server did not (re)start properly, etc.

      1. 1

        And bring the angle grinder.

    4. 2

      Unless Twitter requires manual interventions to run (imagine some guys turning cranks all day long :)) , why exactly would it go down ?

      1. 15

        Eventually, they will have an incident and no one remaining on staff will know how to remediate it, so it will last for a long time until they figure it out. Hopefully it won’t last as long as Atlassian’s outage!

        1. 15

          Or everyone remaining on staff will know how to fix it but they will simply get behind the pace. 12 hour days are not sustainable and eventually people will be ill more often and make poorer decisions due to fatigue. This post described the automation as clearing the way to spend most their time on improvements, cost-savings, etc. If you only spent 26% of your time putting out fires and then lost 75% of your staff well now you’re 1% underwater indefinitely (which completely ignores the mismatch between when people work best and when incidents occur).

        2. 6

          Even worse - things that would raise warnings and get addressed before they’re problems may not get addressed in time if the staffing cuts were too deep.

      2. 8

        That’s how all distributed systems work – you need people turning cranks all day long :) It gets automated over time, as the blog post describes, but it’s still there.

        That was my experience at Google. I haven’t read this book but I think it describes a lot of that: https://sre.google/sre-book/table-of-contents/

        That is, if such work didn’t exist, then Google wouldn’t have invented the job title “SRE” some time around 2003. Obviously people were doing similar work before Google existed, but that’s the term that Twitter and other companies now use (in the title of this blog post).

        (Fun fact: while I was there, SREs started to be compensated as much or more than Software Engineers. That makes sense to me given the expertise/skills involved, but it was cultural change. Although I think it shifted again once they split SRE into 2 kinds of roles – SRE-SWE and SRE-SysAdmin.)

        It would be great if we had strong abstractions that reduce the amount of manual work, but we don’t. We have ad hoc automation (which isn’t all bad).

        Actually Twitter/Google are better than most web sites. For example, my bank’s web site seems to go down on Saturday nights now and then. I think they are doing database work then, or maybe hardware upgrades.

        If there was nobody to do that maintenance, then eventually the site would go down permanently. User growth, hardware failures (common at scale), newly discovered security issues, and auth for external services (SSL certs) are some reasons for “entropy”. (Code changes are the biggest one, but let’s assume here that they froze the code, which isn’t quite true.)

        That’s not to say I don’t think Twitter/Google can’t run with a small fraction of the employees they have. There is for sure a lot of bloat in code and processes.

        However I will also note that SREs/operations became the most numerous type of employee at Google. I think there were something like 20K-40K employees under Hoezle/Treynor when I left 6+ years ago, could easily be double that now. They outnumbered software engineers. I think that points to a big problem with the way we build distributed systems, but that’s a different discussion.

        1. 7

          Yeah, ngl but the blog post rubbed me the wrong way. That tasks are running is step 1 of the operarional ladder. Tasks running and spreading is step 2. But after that, there is so much work for SRE to do. Trivial example: there’s a zero day that your security team says is being actively exploited right now. Who is the person who knows how to get that patched? How many repos does it affect? Who knows how to override all deployment checks for all the production services that are being hit and push immediately? This isn’t hypothetical, there are plenty of state sponsored actors who would love to do this.

          I rather hope the author is a junior SRE.

          1. 3

            I thought it was a fine blog post – I don’t recall that he claimed any particular expertise, just saying what he did on the cache team

            Obviously there are other facets to keeping Twitter up

        2. 4

          For example, my bank’s web site seems to go down on Saturday nights now and then. I think they are doing database work then, or maybe hardware upgrades.

          IIUC, banks do periodic batch jobs to synchronize their ledgers with other banks. See https://en.wikipedia.org/wiki/Automated_clearing_house.

        3. 3

          I think it’s an engineering decision. Do you have people to throw at the gears? Then you can use the system with better outcomes that needs humans to occasionally jump in. Do you lack people? Then you’re going to need simpler systems that rarely need a human, and you won’t always get the best possible outcomes that way.

          1. 2

            This is sort of a tangent, but part of my complaint is actually around personal enjoyment … I just want to build things and have them be up reliably. I don’t want to beg people to maintain them for me

            As mentioned, SREs were always in demand (and I’m sure still are), and it was political to get those resources

            There are A LOT of things that can be simplified by not having production gatekeepers, especially for smaller services

            Basically I’d want something like App Engine / Heroku, but more flexible, but that didn’t exist at Google. (It’s hard problem, beyond the state of the art at the time.)

            At Twitter/Google scale you’re always going to need SREs, but I’d claim that you don’t need 20K or 40K of them!

            1. 1

              My personal infrastructure and approach around software is exactly this. I want, and have, some nice things. The ones I need to maintain the least are immutable – if they break I reboot or relaunch (and sometimes that’s automated) and we’re back in business.

              I need to know basically what my infrastructure looks like. Most companies, if they don’t have engineers available, COULD have infrastructure that doesn’t require you to cast humans upon the gears of progress.

              But in e.g. Google’s case, their engineering constraints include “We’ll always have as many bright people to throw on the gears as we want.”

            2. 1

              Basically I’d want something like App Engine / Heroku, but more flexible, but that didn’t exist at Google.

              I think about this a lot. We run on EC2 at $work, but I often daydream about running on Heroku. Yes it’s far more constrained but that has benefits too - if we ran on Heroku we’d get autoscaling (our current project), a great deploy pipeline with fast reversion capabilities (also a recentish project), and all sorts of other stuff “for free”. Plus Heroku would help us with application-level stuff, like where we get our Python interpreter from and managing it’s security updates. On EC2, and really any AWS service, we have to build all this ourselves. Yes AWS gives us the managed services to do it with but fundamentally we’re still the ones wiring it up. I suspect there’s an inherent tradeoff between this level of optimization and the flexibility you seek.

              Heroku is Ruby on Rails for infrastructure. Highly opinionated; convention over configuration over code.

              At Twitter/Google scale you’re always going to need SREs, but I’d claim that you don’t need 20K or 40K of them!

              Part of what I’m describing above is basically about economies of scale working better because more stuff is the same. I thought things like Borg and gRPC load balancing were supposed to help with this at Google though?

      3. 2
        1. Random failures that aren’t addressed
        2. Code and config changes (which are still happening, to some extent)

        It can coast for a long time! But eventually it will run into a rock because no one is there to course-correct. Or bills stop getting paid…

      4. 1

        I don’t have a citation for this but the vast majority of outages I’ve personally had to deal with fit into two bins as far as root causes go:

        • resource exhaustion (full disks, memory leak slowly eating all the RAM, etc)
        • human-caused (eg a buggy deployment)

        Because of the mass firing and exodus, as well as the alleged code freeze, the second category of downtime has likely been mostly eliminated in the short term and the system is probably mostly more stable than usual. Temporarily, of course, because of all of the implicit knowledge that walked out the doors recently. Once new code is being deployed by a small subset of people who know the quirks, I’d expect things to get rough for a while.

        1. 2

          You’re assuming that fewer people means fewer mistakes.

          In my experience “bad” deployments are much less because someone is constantly pumping out code with the same number of bugs per deployment but because the deployment breaks how other systems interact with the changed system.

          In addition fewer people under more stress, with fewer colleagues to put their heads together with, is likely to lead to more bugs per deployment.

          1. 1

            Not at all! More that… bad deployments are generally followed up with a fix shortly afterwards. Once you’ve got the system up and running in a good state, not touching it at all is generally going to be more stable than doing another deployment with new features that have potential for their own bugs. You might have deployed “point bugs” where some feature doesn’t work quite right, but they’re unlikely to be showstoppers (because the showstoppers would have been fixed immediately and redeployed)