1. 8
  1.  

  2. 6

    There are a few twitter or ex-twitter folks on lobsters, happy to answer questions about this.

    1. 4

      Were there any choices made in the mentioned 2012 architecture that were false starts, or have proven to be pain points, in the time since?

      1. 10

        Not really. The “Ugly” slide covers most of them, although some of them have gotten slightly better.

        One issue that isn’t covered is that because finagle was designed for RPC, it can be slightly awkward for streaming, even though many of our clients want to be able to use finagle for the other nice abstractions. Finagle is in large part about connection management, so it seems like we should be able to reuse a lot of that logic, but we’re still trying to work out some of the kinks in either:

        1. figuring out how to use the RPC-specific APIs for streaming
        2. figuring out how to provide an alternate API for streaming

        There are a few technologies which haven’t really scaled with us, that we’ve either had to modify, replace, or write sophisticated wrappers around. For example, very few services speak directly to redis, we have a cache service that sits in front of redis (that I don’t know a ton about) that almost all caches that need data structures use instead. We have a forked version of memcache and we found that Cassandra was different enough from our use case that it would be faster to implement something similar than to understand Cassandra well enough to change it to fit our needs.

        There’s still a tension between turning things into services or libraries, with a preference for services, but I think that will probably exist for a long time.

        Most of the rest of the choices have worked out well for us. The infrastructure is continuing to mature, but we’re beginning to be able to do very sophisticated things.

        One thing about having a consistent system is that queries of death can sometimes propagate down the system, and because all services share the implementation, they suffer from the same problem. This means that if a service goes out with a bug, that can put it in a bad state, and lets it put services it makes requests to in the same bad state, you can easily take down large swathes of the website. This is both nice, because it means that the entire system can adjust very quickly, and react instantaneously, but is also very dangerous. RFC 3439 advises against allowing local changes to cause global effects:

        An important method for reducing amplification is ensure that local changes have only local effect (this is as opposed to systems in which local changes have global effect).

        This has caused a few missteps, but we’ve seen enough benefit that we think it’s OK. Also, since we’re centralized, unlike the internet, fixing problems is much easier for us.

        1. 5

          Great response. Thanks!