I think something this gets at a bit is that most popular distributed systems research these days are around consistency, which only a very slender piece of the distributed systems pie. I think that part of it is that the most sophisticated distributed systems many people run into* are distributed databases, and it can be impossible to debug many problems in distributed databases without understanding how to reason about consistency and availability.
However, there are many other areas of distributed systems that haven’t been getting as much love, even though they end up popping up quite a lot. For example, picking efficient load balancing strategies can be a hard problem, but many people just pop up an nginx and call it a day.
On the other hand, I haven’t been able to find practically any recent literature on service discovery, except rfc6763. Detecting liveness, failure accrual, backpressure implementations, isolation strategies. The AWS architecture blog actually has pretty good material, and we’ll occasionally see a really broadly useful paper from an industry luminary, like Jay Kreps, or Jeff Dean, but in general, most of the hip papers these days seem to be of pretty narrow interest.
To be perfectly clear–I think it’s important that distributed systems engineers understand consistency and availability tradeoffs if they want to build systems that can be reasoned about meaningfully, but those aren’t the only important ideas. I don’t think it’s a mistake that the plan9 docs don’t go on forever about the consistency guarantees of their filesystem.
In defense of this list in particular, it is reasonably good in that it does make an attempt to address other areas. @jmhodges' youngbloods, the fallacies are good examples of articles that have come out of industry that address a much fuller range of problems in distributed systems engineering.
Have I just been following the wrong people on twitter? Are people discussing other kinds of papers? I feel like I need to go back to the 80s to find original research about topics like RPC, naming, or distributed file systems.
*: that they have to think about–the internet is probably actually the most sophisticated, but it’s abstracted away well enough at this point that most people don’t even need to think about it.
I think the problem you are identifying is one of scale. Consistency and availability are independent of scale. But effective load balancing is something where most people can put an nginx up and call it a day.
Consistency also forms the underlying conceptual basis for things like service discovery. Service discovery becomes much lose to straight-forward when you define it in terms of a consistency problem, which means it’s not necessarily worth it to discuss independently.
That’s a really good insight, I hadn’t thought of that. I guess I assumed that everyone who was working on distributed systems had big clusters, but obviously you need to be reasoning about distributed systems even just when it just becomes more expensive to buy bigger machines.
Could you elaborate on what you mean by service discovery becoming more straightforward when you think of it as a consistency problem? It seems like you could use a very loose form of eventual consistency–for example, you could probably just use timestamped last write wins, set up NTP properly, and have machines that announced themselves but aren’t getting traffic reannounce themselves until they get traffic (unless you need a leader election). But many service discovery systems are built on top of zookeeper, or raft. serf seems like it makes the tradeoffs that I’d like it to, but I haven’t heard anything, good or bad about people using it in production.
Some people use DNS, but it seems really slow. I guess you could set DNS caching to be 0 seconds but that would be unnecessarily painful for large clusters.
Could you elaborate on what you mean by service discovery becoming more straightforward when you think of it as a consistency problem?
All service discovery effectively boils down to a key-value store and the differences are the consistency model around that store. Service discovery isn’t really a thing of it’s own, it’s a use case.
I don’t know if that’s all it is. How do you handle failover, inserting new machines, removing old machines? You can have polling systems and just accept some latency, but you often want service discovery to be fast, and to support a push mechanism. It would also be nice if it was durable, so that even if it fell over, it wouldn’t take forever to reconstitute a server set.
Almost all of those issues come up in consistency papers.
But usually the assumption is that someone else handles it for you, I think. Usually they end up doing a half-assed job of it (cf cassandra + gossip), and it’s only addressed because it’s a general distributed systems concern and any implementation paper of a distributed system must mention it. Maybe you’re thinking of different papers? Paxos for example doesn’t discuss service discovery at all.
I don’t think I’m seeing what you believe is being missed. As I said above, service discovery can be modeled as a key-value store, in which case the question is what kind of consistency model you have. So, no, the consistency models do not mention it explicitly, but if what I said is true they don’t need to.
Are you suggesting that service nodes are members of a distributed KV-store, forming the service discovery KV-store? Rather than, say, modeling service discovery on top of a KV-store?
ZooKeeper, for example, is just a KV store with a consistency guarantee.
Some good links to papers here! Great summary