The added file/daemon system looks like an eventually consistent replication wrapper around zookeeper, which they’ve added as a way to get higher availability. The consistency of the composite system (zk+wrapper) is only eventual (unlike the zookeeper subsystem) because the cache files won’t be updated and may be out of sync with each other when zk is down.
If this is right, why not just replace zookeeper with a system that was designed to be eventually consistent and highly available (riak, maybe), and not have to worry about the possible bugs introduced by the new replication system: the daemons and the client libs for reading the files? Does zk still have an advantage if you are accessing it through this layer?
The corruption safeguards are another matter. Those are probably worthwhile regardless of the choice of distributed data store. Human error being what it is…
The only thing I can think of is ZK’s support for watching Znodes. Their daemon could keep the conf file updated immediately upon changes when ZK is up. Other systems would generally require polling to attempt the same.
That said, Serf is an eventually consistent system that supports “run this thing when key X changes” (iirc), so that may be a good fit, like you said.
Good call about serf. Pinterest have engineered their system to tolerate temporary and intermittent inconsistency in things like lists of nodes and services, so they could choose AP rather than CP. Plus (I’m guessing) the gossip protocol might be more efficient than updating those files and daemons from the ZK cluster. Plus, “5 to 10 MB of resident memory” per process (serf, Go) vs jvm.
I imagine the daemon will need careful tuning, not to mention its own additional monitoring check. It will be confusing to push an update to ZK which is ignored by the daemon because the new value is “too different”.
Anyone know what protocol bugs they are talking about?