At AWS, we’ve fairly broadly deployed a protocol called SRD . I don’t think anybody really disagrees with the idea that deploying other protocols inside the datacenter (and even across the WAN) has a lot of potential benefits. This is already happening all over the place. It’s a nice win for abstraction (to snan’s point) that the internet makes this possible.
I don’t think any of these objections to TCP are wrong (in fact, a lot of them are right). But in practice, most systems already work around these problems in various ways, and so the competitor to something new isn’t naive use of TCP, it’s many years of optimized TCP deployments. That’s a much higher bar!
I’ll admit that Ousterhout’s framing of things often rubs me the wrong way, so maybe I’m a bit biased here.
 See “A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC” https://ieeexplore.ieee.org/document/9167399, https://assets.amazon.science/a6/34/41496f64421faafa1cbe301c007c/a-cloud-optimized-transport-protocol-for-elastic-and-scalable-hpc.pdf
Workarounds can get you only so far. As more and more systems will grow the bandwidth, I do think that we’ll hit a wall. Ousterhout talks about 100 Gbps links - AWS has links of that size only on few, very largest instance types, and I’d guess those are some of the least used ones. When we’ll start getting 200Gbps links in “common” cloud servers, I expect these problems start hurting really badly. It’s better to start looking into solutions now, rather than wait until everyone needs to deal with those problems.
When the optimizations of old algorithms are no longer enough, you start looking into new algorithms, that could offer better complexity just by themselves. They might need a certain scale to get equal in efficiency, but after that point, it’s pointless to even try with a worse algorithm.
The fact that this might even be possible is a giant win for the stack architecture of the internet 🥰
a tangential question on the Homa transport (or SDR as @mjb noted below) – do these protocol solve latency issues encountered by interactive video streaming applications.
Eg guitar lessons via skype or other such systems, typically suffer the delay between visual and audio portions.
I know that folks are trying to solve it via specialized, latency detecting algorithms, but I am wondering if these low-level transport protocols would sort ‘carry the water’ and free the video streaming developers from dealing with this low level protocol complexities/limitations that we have with TCP.
Video streams are one of the things where TCP does actually make sense, since TCP is made for transporting streams of data. The problems where Homa or similar protocols help are primarily various kinds of RPC communication, and with proliferation of microservices, that’s the biggest usage of the network in datacenters.
But for interactive applications TCP is usually overkill - humans can work it out if video drops for a second or two, so you don’t need reliability guarantees (as much) for this application. It’s usually done over UDP, just hoping that all packets reach the end user. And there isn’t that many packets for each video stream - I think ~1 Mbps is about standard per participant, so that’s 100 packets per second. If you drop a couple, you can just drop frames until the next P frame comes up. What’s often causing the latency that can cause large delays, is bufferbloat and inability to create P2P connections between participants, so they need to route through some server, which usually adds latency if participants are geographically close to each other. So the reason for latency usually is somewhere in the network, not the transport layer.