As someone who does distributed systems, I’m a little surprised at the claim that grpc makes distributed systems “as easy as making local function calls” – while I believe that in google’s own datacenters this may be true enough to rely on, it seems pretty unlikely to be true in AWS, GCE, or even most ‘enterprise’ local deployments, given that grpc appears to rely on HTTP/2 over TCP, and TCP has, ah, nontrivial behavior in some environments.
TCP has, ah, nontrivial behavior in some environments
Can you elaborate on that?
TCP is a serialized protocol where both sides chat back and forth on every packet, so if one side or the other experiences latency or packet loss (whether due to slow or overloaded applications, full or contended packet queues in the kernel, cheap or bad or failing NICs, bad wiring, cheap or slow or bad or overloaded or distant intermediate routers, or some combination of all of those (i.e. the public internet)), then every communication on the socket will hang until the first latent or lost packet is retried or otherwise gets through.
This causes highly multiplexed protocols over TCP to pass along the lag to every channel receiver, obviously.
Applications generally expect function calls to return so quickly as to be virtually instant, and to not have to retry them on failure. So situations in which the communication is broken, packets are lost, the network fails temporarily or permanently, or even just situations in which any latency above your maximum expectation occurs and causes an app level timeout, will either cause unexpected (and possibly fatal) backpressure, unexpected (and possibly fatal) application level retries or crashes, or other hilarity, like exposure of completely unreasonable race conditions that would never otherwise happen, etc., etc., if you naively replace a native function with a distributed system, you will have a variety of bad times that look like nothing you’ve ever seen before.
In situations like Google’s, where they have obscenely expensive, gorgeous, exquisitely maintained switches in a big matrix sitting right on top of their racks (note: this is hearsay; I don’t yet work at Google), and there aren’t any routes going through an EC2 ELB or Joe’s VPS and Bait Shop in Dubai over a satphone, there are probably very few of these events, so maintaining the fiction that it’s just like a regular procedure call is worth it and a super-powerful abstraction that they can gift to their developers.
For those of us using Joe’s VPS and Bait Shop in Dubai, the window of risk is radically larger. Our scale problems are also a lot smaller, though; but it can be very difficult for us to know where our (collective) service providers' failure point is, where believing in the abstraction is dangerous and unviable. It’s unquantifiable because we don’t have enough visibility into the points of possible danger; we can only guess, based on our failure surface, about how bad the problem will be.
If pressed for recommendations, my gut tells me that the average dev on the average infrastructure should stick with the idea that network communication should be heavily decoupled (e.g. JSON over HTTP or protobuf over HTTP or whatever), rather than try to use grpc as a function call replacement mechanism directly. You don’t get Internet Scale with JSON over HTTP but you have a nice reliably bad protocol with lots of debug handles and well understood logs and stack overflow. People with serious distributed system problems should continue to use erlang ( :D ). People with gorgeous deluxe infrastructure running locally should totally jump all over grpc, especially if they integrate QUIC, but that’s for another day and another five paragraphs.
It seems to be “reactive streams” from people who don’t understand the need for back pressure, think that communication across machines can or should look like local method invocation, haven’t checked the existing state of the art in the last 15 years and believe that the primitive onX API is all what people need.
Accusing the folks at Google of not understanding backpressure or not understanding the existing 15-year state of the art is a little far fetched. It’s just that grpc comes from a radically different environment than the ones that you and I are used to, and has different affordances and abstractions than solutions that grew out of our more common environments. As the two environments come into contact with each other, there will be friction and weirdness, but I suspect we have as much to learn from Google as Google does from us.
Or maybe they only open-source the bad stuff and keep the good stuff secret as a competitive advantage? The things they have released … e. g. GWT, Android, Go aren’t nearly as good as they want everyone to think they are in my opinion. It’s usually quite underwhelming to see the hype vs. the actual code.