tl;dr : The author mistakes cache coherency for a thing that scales.
The concept of multiple-readers-single-writer has been known for ages with many solutions and generally you need a lock because there are race conditions where a write is not atomic. The point of the single-writer is the idea that readers can simultaneously read and share their lock as long as nothing is writing (and obviously, only one writer can ever exist at a time)
Trivially, only if the write is atomic can you avoid the lock. It has been already known for a long time that message passing systems avoid contention by copying contended data. A write is always atomic if it is only seen once copied after it the update is done. The article states cpu cache coherency is used to copy the data. Somehow. But the cache will certainly see a partial update, right? If the cpu cannot update memory atomically, then it cannot update cache either. This cannot work as described.
Also, this article assumes that the last level cache contains a value consistent with the writer. You can’t rely on the cache coherency! It doesn’t scale! Coherency == global consistency == non-scale-town. On a highly scaled system (implied distributed) the nodes may exist on other machines, and even locally the cache may be split or non-uniform. Why? Because those algorithms aren’t magical. It scales so poorly, cpus with more than a certain number of cores don’t even include it… it costs too much.
So you’ll have to send messages, basically implementing your own multicasted coherency model. Otherwise, you are limiting yourself greatly to having your writers and readers all share the same cache, and thus limit the number of processes. Not to mention the rather frightening prospect of this running on a hypervisor.
So, this system as described to “work at all level of scale” is really just a single writer with all readers having a local copy with the readers distributed in some form of tree for a multicast (lots of interesting, fascinating papers for organizing such a tree). Which means it is roughly to my naive estimate O(log n + k) for distribution (to message pass the new value) in a balanced tree where k is the number of times a server drops a packet and postpones the subset from propagating. Now you also have to put provisions in to ensure ordering, handle failure (what if your writer dies!?). A nightmare. But it’s a distributed system, so your entire life is a nightmare. :P
It’s decentralized reading, centralized writing 101. And it. Does. Not. Scale. (unless you don’t write. just stop writing things)