So, I have a serious question: I understand different databases have different trade-offs; that’s fine. But since jepsen tests seem to reliably fail in non-intuitive ways on MongoDB, I’m having trouble figuring out two things:
Are services running on MongoDB just losing data constantly and no one notices? If not, has it decreased the frequency, or the failure states, compared to five years ago?
Does this imply that there should be some sort of “jepsen-for-the-99%” test? What would it take for MongoDB to legitimately pass? What else that currently fails jepsen would then pass?
Sort of, yes. If your network experiences a hiccup, your mongodb cluster can go AWOL or FUBAR, depending on how the dice roll. That is on top of the usual problems with organically growing document stores…
To legit pass, a MongoDB server should handle network failure to the cluster by becoming either unavailable or, if a quorum is present, continuing operation. Continuing operation in the absence of a quorum or any other mechanism to ensure data consistency is an immediate fail IMO.
So, I have a serious question: I understand different databases have different trade-offs; that’s fine. But since jepsen tests seem to reliably fail in non-intuitive ways on MongoDB, I’m having trouble figuring out two things:
Yes, services have just been losing data. Take parse for instance:
https://medium.baqend.com/parse-is-gone-a-few-secrets-about-their-infrastructure-91b3ab2fcf71
Network partitions & failovers are both relatively uncommon operations in day-to-day operations.
You’re only moderately likely to lose a few minutes of updates once every few years.
This is something that has proven to be untrue many times over and has been refuted by @aphyr himself:
https://queue.acm.org/detail.cfm?id=2655736
I said “relatively uncommon”; that is, not frequently enough to cause enough data loss to kill a business built on it.
Sort of, yes. If your network experiences a hiccup, your mongodb cluster can go AWOL or FUBAR, depending on how the dice roll. That is on top of the usual problems with organically growing document stores…
To legit pass, a MongoDB server should handle network failure to the cluster by becoming either unavailable or, if a quorum is present, continuing operation. Continuing operation in the absence of a quorum or any other mechanism to ensure data consistency is an immediate fail IMO.