Author here. To be clear, all the stories in this series are jokes, not endorsements. Engineering interviews are a complex problem and I won’t begin to discuss them here, except to say that there are people whose entire lives involve studying and improving teams of humans, they’re called organizational psychologists, and it might be worth hiring some.
Since folks have expressed incredulity that these techniques are at all involved in modern programming jobs, I should note that I have had to implement trees, sorting algorithms, and parsers in my professional code. For instance, I wrote the clj-antlr parser bindings to support Riemann’s query parser, which is chock full of tree transforms, on account of it being a compiler for search predicates. Knossos includes a hand-rolled binary search over… what are essentially packed and bitmasked structs, masquerading as a JVM int array. There’s also some graph neighborhood stuff in there that allows us to memoize arbitrary function calls over objects into pointer-chasing through densely packed int arrays. Gretchen implements a number of simplification algorithms over trees of boolean expressions, as well as Tseitin-expansion to reduce predicates to conjunctive normal form, which are required to turn transaction dependency graphs into constraint- and SAT-solver-friendly expressions. This is, like, the third or fourth time I’ve tackled this problem.
I don’t… think I’ve done a regex algorithm yet, but my former coworker Zach Tellman has done quite a few DFA and NFAs in his time; Automat comes to mind. I’m sure it’s only a matter of time before I wind up doing it too.
My experience is, of course, not representative. :)
Why is it not representative? Seems like problems any experienced person may encounter in their travels. Enjoyed the story BTW.
That’s the hope, but I suspect not everyone gets the chance to venture out into the interesting waters of comp sci.
I’d guess because the median programmer is writing business code and never writes library code in their entire career.
Come on, that’s rock-star stuff.
It’s true, it’s obviously great work. My point is that aphyr IS representative of experienced people. Experienced people … experience many variety of problems.
This is something that the rest of the team and I have been working on for more than a year. I think open source sustainability is going to be one of the big issues the tech community needs to face in the next ten years, in all communities, but particularly in niche ones like Clojure. Ruby Together has a really good model, so we copied it and applied it to the Clojure community (with some tweaks). Happy to answer any questions people have.
Thank you for putting this together–all of you. I’m signing up Jepsen as a corporate sponsor right now.
For testing byzantine faults, it’s important to keep results actionable. https://github.com/madthanu/alice does a nice job of introducing realistic filesystem semantics for crash testing. You may need to run it on the ubuntu version they recommend, as their patched strace is a little out of date. It’s a pretty simple tool to use other than that!
Jepsen is a distributed systems test first and foremost, but yes, for single-node faults, tools like Alice are nice. I actually spent a week or so on a research project involving filesystem-level faults, but it didn’t product useful results within the time I had available.
Recovery correctness has a ton of implications for distributed systems. It’s vital for leader election in broadcast-replicated systems with progress-related invariants, like what raft needs to enforce. I’ve also come across several popular (purportedly linearizable) distributed databases that will do things like start serving before their recovery process completes, returning stale reads just behind the in-progress recovery scan of the WAL. You’ll find gold if you look.
Wow thank you, that sounds like a really interesting research opportunity!
If you’d like to see an example of how these numbers play out in practice, take a look at Zach Tellman’s benchmarks for Bifurcan (https://github.com/lacuna/bifurcan/blob/master/doc/benchmarks.md), which compare Java’s ArrayList to Clojure’s vectors (which, like Scala’s vectors, are 32-way hash array mapped tries), and also Java’s HashMap and HashSet to Clojure’s hashmaps. You can also see how some of the decisions in his linear and persistent collections (List vs Linearlist, Map vs LinearMap, etc) affect performance.
Out of curiosity, when you evaluated Knossos performance, were you testing using histories with crashed clients, or happy-path histories where client operations succeed relatively quickly? Knossos makes some choices which optimize for the former, and I think the P-compositionality paper focuses on the latter, but it’s been a few years since I’ve been down in those papers guts. May need to revisit those assumptions if they were slower for your workload.
To make the comparison more fair to Knossos, I tested histories where you can’t take advantage of P-compositionality. (in short, P-compositionality is when a history is linearizabile iff all subhistories in a partitioned history are linearizable - e.g. with a map, you can partition by keys and check the subhistories independently, and that’s a lot faster)
I used test data from Jepsen etcd tests: https://github.com/anishathalye/porcupine/blob/master/test_data/jepsen
Here’s the quick-and-dirty benchmarking code I used: https://gist.github.com/anishathalye/a315a31d57cad6013f57d2eb262443f5 (basically, just timing knossos.core/linearizable-prefix).
Even where Knossos is slow, e.g. etcd_002.log and etcd_099.log (timed out after > 2 days). Porcupine seems to do fine, taking hundreds of milliseconds on a single core to check the histories.
Out of the ~100 tests, filtering for ones that Knossos finished in < 1 hour, we have a speedup of 1000x on Knossos’s fastest test (etcd_016.log) and a speedup of 20,000x on Knossos’s slowest test (etcd_040.log). And for the ones that timed out (because I didn’t want to run the tests for way too long), e.g. (etcd_099.log), Porcupine had a speedup of > 10^6.
I haven’t had time to look into Knossos’s implementation in detail and figure out exactly where Porcupine’s speedups are coming from, but that would be cool to do at some point. Jepsen/Knossos are obviously a way more complete solution for testing distributed systems, and it would be cool to speed up the linearizability checking aspect.
Ohhhhhh! Yeah, you’re using the original algorithm–that’s definitely slower. Try (knossos.linear/analysis model history) instead–that’s based on the JIT-linearization algorithm from Lowe’s paper, plus some additional optimizations–instead of performing union-find for compaction, we pre-compile the state space into a graph of pointer arrays, which turns the search into immutable pointer-chasing instead of running model code. There are… certain cases where the knossos.core algorithm is preferable (it offers better parallelism) but linear should be significantly faster. Still not good though; I’d like to sit down and figure out some alternative strategies.
And yeah, as you note, we don’t do P-compositionality in Knossos–that’s handled by Jepsen, which performs the decomposition into independent keys for maps, sets, etc, then hands individual histories to Knossos. Would be nice to fold into Knossos later though!
Last thing, if you wind up packaging this for distribution, I’d like to offer a hook in Jepsen so we can pass histories to it. If you’d like to define some sort of serialization format (JSON, tsv, EDN, protobufs, etc) for passing histories in and getting analysis results back, I can wrap that up in a Jepsen checker as an alternative strategy. :)
Oops, I didn’t know that. I redid the benchmarking with (knossos.linear/analysis model history), running Knossos on 6 cores and Porcupine on 1 core.
The benchmark results did change: Knossos completed every check significantly faster. On some tests, the new algorithm performed significantly better: e.g. on etcd_040.log, Porcupine has a speedup of 278x, as opposed to a speedup of 20,000x when comparing against the original algorithm (knossos.core/linearizable-prefix).
Porcupine still ran faster on all tests, though; the following is a summary of the results (over the ~100 Jepsen etcd tests):
Min speedup: 8.1x on etcd_002.log
Max speedup: 21,219x on etcd_067.log
Median speedup: 1000x
Ooh, that sounds cool! I’ll probably end up packaging this for distribution in a couple weeks, and I’ll definitely reach out to you once I have an API for communicating with Porcupine.
The motherboard is wonky and refuses to find half the disks on boot. You can crash the box by using certain USB ports. We have a complicated relationship.
The motherboard is wonky and refuses to find half the disks on boot. You can crash the box by using certain USB ports. We have a complicated relationship.
I’m curious about this. You can buy a stable, reliable computer with 48 cores and 128GB of ECC RAM completely off the shelf — Dell/HP/etc sell rackmount or tower servers with these configurations.
Is using a really powerful but unreliable desktop computer a net productivity advantage relative to using a small reliable desktop + SSHing into a big reliable server? I appreciate that remote debugging is often not as nice as local debugging, but on the other hand remote debugging has some nice side benefits like the fact that the UI that you’re using doesn’t go unresponsive when the machine gets loaded.
I can totally sympathise if it turns out that the root cause of this was just that Kyle just really really wanted a really overpowered computer out of sheer nerdlust.
I know it’s a time-honored tradition to armchair-architect strangers’ technical decisions without regard for the privileges of context or experience, but fuck it, I’ll bite.
I used to rebuild and rack servers for a living, and considered that here, but ultimately decided I wanted a workstation.
It’s quieter; I didn’t feel like having a screaming banshee 2u sitting in my tiny SF bedroom. It means owning one computer instead of two, which is cheaper, takes up less space, and cuts down on my time spent doing stupid sysadmin stuff. It’s also way less of a pain in the ass to work with than the janky-ass combination of remote filesystems, SSH tunnels, rsync hacks, X forwarding, and Yourkit injection that I have to use with remote Jepsen clusters.
Thank you for replying! I’m sorry if I came off as insulting. Edit: I apologise for insulting you. That was not my intention.
I appreciate the noise issue. I’m used to 1U servers being awful for it because the fans are small so the noise is high-pitched, haven’t gotten my hands on 2U or up to see if they’re much quieter. I thought tower servers were supposed to be no worse than desktops in this regard? Since they’re not that different and can use similarly huge fans?
janky-ass combination of remote filesystems, SSH tunnels, rsync hacks, X forwarding, and Yourkit injection
janky-ass combination of remote filesystems, SSH tunnels, rsync hacks, X forwarding, and Yourkit injection
Ouch. Good point, avoiding that mess is worth a lot of effort.
My guess is that its some kind of whitebox. I’ve never had good lucky with them, and always some kind of jank. I replaced my whitebox server with an SFF business desktop and it’s been far better in stability.
VoltDB doesn’t have a whole lot to sell to me, really. That said, if I understand the chain of events properly, VoltDB sponsored this Jepsen post (and all the research that went into it), took those findings and started fixing the problems unearthed. That’s an admirable commitment to both data safety and openness, and it means I’ll consider VoltDB preferably over competitors should I ever need that feature set.
Yep, you’re understanding correctly. Like RethinkDB, VoltDB approached me for help in testing their systems, and funded the research. I found initial cases pretty quickly, deeper problems over the next month, and worked with their team for the next month or so to create more stringent tests and evaluate proposed fixes–VoltDB ported some of these test scenarios to their internal test suite, and is integrating Jepsen into their testing cycle now. That work culminated in the release of 6.4 last week. You can read more about how I handle sponsored research, and see the full set of bugs we uncovered on VoltDB’s issue tracker.
Started a comment here but it got too long: https://aphyr.com/posts/325-comments-on-you-do-it-too
There are some points where I just don’t understand you (GC are not partitions in the CAP proof, and it’s not new: see cap faq point 16 http://henryr.github.io/cap-faq/). You can push it into it (for example by putting some deadline constraints in the definition of availability), but it’s not really CAP anymore.
It’s easier to choose an algorithm which is safe in the general case of asynchronous networks,
Yeah this one is interesting. I agree on the algo choice (there is no other choice anyway), but many errors (typically not flushing the writes to the disk) are visible only with kill9 or node failures. You have to test independently. Partition tolerance does not mean fault tolerance.
Partitions in the G&L proof are any pattern of message loss. Any garbage collection period which exceeds the duration an application will wait for a message before considering it lost (e.g. tcp socket timeouts, application-level timeouts) will get you the same behavior.
Yep, that’s what I call pushing it into it (application cannot wait forever => there are deadlines constraints). CAP applies in this case (i.e. you really have to choose between consistency & availability).
GC are still a little bit simpler than a true network partition, because the process stops working for everybody. Agreed, there will be some nasty race conditions at the end. But you don’t loose a full rack when there is a GC. It’s a much nicer kind of fault, one node at a time. In a big data system, if you lose a rack you have to replicate the data again for safety. With a GC you do not need that: the node will come back to life before you start to replicate the 2Tb of data it was storing (not to mention the case w/ a rack!).
I do agree with you on the asynchronous part: the algorithm you need to choose when the network is asynchronous will help a lot with partitions and with GC. But you need to test both.
It’s a much nicer kind of fault, one node at a time
GC is a notorious cause of distributed systems data-loss, typically where it allows two nodes to believe they’re the primary at the same time. Take a look at the Elasticsearch mailing lists sometime for all kinds of horror stories around high load, memory pressure, or GC causing inconsistency.
I’m not sure if the size of the fault is necessarily relevant for this discussion, either.
Agreed again, but node failures and network partitions will add a few other horror stories.
I mean, I would expect a software vendor to say
We have tested dirty node crashes, no data loss
We have tested GC. No dataloss, no performance issue if the GC is contained within 10s.
We have not tested network partitions. Per our design it should be fine (we’re aiming at AP: availability during partition), but it’s still an edge case.
Rather than: “we’re partition tolerant of course.”
And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield). Not that easy to do (I don’t know ES).
Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition. Even if LAN fabric were totally reliable, I’m trying to remind people that partition tolerance is about message delivery, not just what we traditionally consider “the network”.
And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield).
Gilbert & Lynch, footnote 4: “Brewer originally only required almost all requests to receive a response. As allowing
probabilistic availability does not change the result when arbitrary failures occur, for simplicity we are requiring 100% availability.”
Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition.
I doubt I said something like this :-)
But yeah, for sure the whole post is only about network partitions. I will update the post to make this clear.
“CA exists, and is described as acceptable for systems running on LAN”
“Stonebraker was considering small systems of high range servers on a LAN”
“it is also fine to assume CA on a LAN”
None of those are mine (lynch/stonebraker/brewer)
Both Stonebraker and Brewer consider (quoting Brewer but Stonebraker said exactly the same thing) “CA should mean that the probability of a partition is far less than that of other systemic failures”, so even if they think that CA is acceptable in some specific cases on a LAN that does not mean they think that LANs won’t partition.
GC are still a little bit simpler than a true network partition… It’s a much nicer kind of fault, one node at a time
This is usually the case. However, I’ve also seen the back pressure resulting from a GC cause other nodes to become loaded. The other nodes then started a GC. Now there was a feedback loop and the entire system ended up falling over.
The system could have been configured better… but that’s kind of the same point about experiencing partitions on a LAN. It’s not cost-effective and you’re still going to miss something.
People seem not to realize (or not to want to realize) that CAP is a mathematical theorem which is rigidly true in every circumstance and cannot be talked around or ignored. It’s simply a fact. In the presence of network partitions, your system will sacrifice consistency or availability (or, in a no-doubt worryingly large number of cases, both); the most you can do is pick. (It is a safe bet, by the way, that availability is the wrong choice [edit: to keep; that was entirely unclear, sorry].)
(As an amusing-to-me aside, CA is what every system should be in the absence of partitions. If your system cannot deliver both consistency and availability even when everything is going right, it is worse than useless.)
As an amusing-to-me aside, CA is what every system should be in the absence of partitions.
Sometimes, for latency reasons, you might want lower consistency even when the network is fully connected. Figuring out when you want that, though… maybe an open research problem, haha.
As aphyr said, you may want to do this for latency reasons. For example, theres PA/EL in Abadi’s PACELC taxonomy. Many traditional architectures with replications trees of relational databases offer these tradeoffs, as do many “quorum”-based database systems.
Along with latency, there’s also scale. Again, with relational databases it’s fairly common to have async replicated read replicas take read-only traffic, or to have multi-master configurations with async replication. These systems choose A over C entirely for performance reasons, and may not actually intend to be available under partitions. In fact, many choose a “bounded staleness” model, where they stop returning data after some known staleness, which is not possible to achieve with full availability under partitions. These kinds of systems - very sensible systems - are neither C (linearizable) or A (fully available) under partitions.
This is true. Actually, the extreme strength of the notion of consistency Brewer used (that is, linearizability) is a point that can be used to argue against the conclusions of the CAP theorem, because depending on the data model, many systems can be meaningfully consistent without full linearizability.
I’m not aware of any work to prove (or disprove) the CAP theorem for different notions of consistency, though I would conjecture that the lower bound on consistency possible while maintaining availability is uselessly low.
I’m not aware of any work to prove (or disprove) the CAP theorem for different notions of consistency
I suggest http://www.vldb.org/pvldb/vol7/p181-bailis.pdf, which includes a handy summary of impossibility results for various consistency models.
I can not remember the name good enough to find it in google, but MSR had an interesting paper trying to figure this out to some degree. Pileaus or something.
(looks like your memory is totally CA.)
Yep, that’s it! That only does these adaptive things on reads.
This article is making me really nervous but I know I don’t have the distributed chops to prove it wrong.
I’ll say this: when an author starts talking about probabilistic interpretations of the theorem and going on about cost-benefit analysis (seriously, why are we worked up about poor “administration interfaces” here?!) my BS needle starts twitching. And when they do that when an impossibility proof exists that shows element availability and atomic consistency are not both possible, it starts swinging around madly.
The article reads like an awful lot of language lawyering around fairly well understood concepts, but I’m not sure what the motivations of the author are.
Heh… Sigh. It reads like an attempt to illuminate, but a bad one. That seems worthwhile if it were shorter and clearer; I don’t think the concepts are actually all that well understood, unfortunately. At a previous job, after two months of arguing that Riak was the wrong choice for the company, I finally got through:
Me: “What exactly is the benefit you’re hoping for from using a distributed technology? Uptime? Preventing data loss?”
Them: “Yes, both of those.”
Me: “Those are mutually-exclusive in our situation.”
Them: “Oh… Maybe something else would be okay.”
(And no, they aren’t inherently mutually exclusive, but the data was peculiar and merging later, after resolving a partition, wasn’t an option. I can’t go into it.)
I definitely don’t want that to be read as an insult to the intelligence of the person involved; they were quite competent. It’s just that databases are a subject not all engineers actually know very much about, and distributed ones are a rather new technology in the scheme of things.
It’s worth noting that not all distributed systems are databases, too, of course!
That’s not what the impossibility proof says–he references that paper.
“In 2002, Seth Gilbert and Nancy Lynch publish the CAP proof. CA exists, and is described as acceptable for systems running on LAN.”
“If there are no partitions, it is clearly possible to provide atomic, available
data. In fact, the centralized algorithm described in Section 3.2.1 meets
these requirements. Systems that run on intranets and LANs are an example
of these types of algorithms” 
I don’t think CAP is very well understood. I think folks end up very confused about what consistent means, and what partition-tolerant means.
I think this is pretty well researched. I’m not sure why cost-benefit analysis makes you nervous.
James Hamilton of AWS says it best, I think:
Mike also notes that network partitions are fairly rare. I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide.
Gilbert & Lynch’s implicit assertion is that LANs are reliable and partition free; I can buy this in theory but does this happen in practice? When Microsoft performed a large analysis of failures in their data centers, they found frequent loss occurring that was only partially mitigated by network redundancy.
But either way you make a fair point: CA models aren’t strictly precluded by that proof. I’m just not certain I’ve seen a network that is trustworthy enough to preclude partitions.
Network partitions are not even remotely rare, honestly. LANs are actually worse culprits than the Internet, but both do happen.
You already cited one of the better sources for it, but mostly I believe this because I’ve been told it by network engineers who I respect a lot.
Even if network partitions were rare, I’ll tell you what aren’t (for most people): garbage collections. What I did not like about this post is it, over and over again, just talks about network partitions and the actual networking hardware. But weird application-specific things happen as well that appear to be unresponsive for longer than some timeout value and these are part of the ‘P’ as well.
In reality, I think CAP is too cute to go away but not actually adequate in talking about these things in detail. PACELC makes the trade-offs much clearer.
LANs are actually worse culprits than the Internet
Funny you mention that: over the past few days I’ve been fighting an issue with our internal network that has resulted in massive packet loss internally (>50% loss in some spikes), and ~0.5% to the Internet. That’s probably why this article raised my eyebrows - it’s my personal bugbear for the week.
The culprit seems to have been a software update to a Palo Alto device that stopped playing nice with certain Cisco switches… plug the two of them together and mumble mumble spanning tree mumble loops mumble. The network guys start talking and my eyes glaze over. But all I know is that I’ve learned the hard way to not trust the network - and when a proof exists that the network must be reliable in order to have CA systems, well…
Heh - my sympathies.
I think some of the confusion comes from describing all node failures as network partitions. In reality “true” network partitions are rare enough (lasting in durations long enough to matter to humans), but nodes failing due to hardware failure, operational mistakes, non-uniform utilization across the system, and faulty software deploys are sometimes overlooked in this context.
i like the comment above “It’s worth noting that not all distributed systems are databases, too, of course!”, but i think this is also a matter of perspective. most useful systems contain state, isn’t twitter.com as a network service a distributed database? kind of neat to think about
It’s not clear to me that the distinction the author makes between a CA and a CP system exists. He uses ZooKeeper as an example of a CP system, but the minority side of networking partition in ZooKeeper cannot make progress, just like his CA example. In reality, CP seems to be a matter of degree not boolean, to me. Why does a CP system that handles 0 failures have to be different than one that handles 2f-1?
When the system availability is zero (not available at all) after a partition, you can claim both CP and CA (that’s the overlap between CP/CA).
There are two corner cases when the system is not available at all:
the system does not even restart after the partition. You can claim CP theoretically. The proof’s definitions don’t prevent this formally. But it makes little sense in practice.
the system restarts after the partition and remains consistent. Both CP and CA are ok.
But ZooKeeper is not concerned by these corner cases, because it is partly available during the partition.
No, you can’t: a system which is not available during a partition does not satisfy A, and cannot be called CA. If you could claim both CA and CP you would have disproved CAP.
CA means: I have a magical network without partition. If my network is not that magical at the end, I will be CP/AP and more likely in a very bad state, not fully available and not fully consistent.
I’m responding to “When the system availability is zero (not available at all) after a partition, you can claim both CP and CA”. Please re-read Gilbert & Lynch’s definition of A: you cannot claim CA if you refuse to satisfy requests during a partition.
But those magic networks do not exist, so how can a CA system exist?
:-) It exists until there is a partition. Then the most probable exit is to restore manually the system state. 2PC with heuristic resolution being an example.
Or, if you build a system for machine learning: 20 nodes with GPU, 2 days of calculation per run. If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.
That system is still CP. You are tolerating the notion of partitions, and in the case of a partition you sacrifice A (fail to fulfill a request–a job in this case) and restart the entire system for the sake of C.
It exists until there is a partition.
If a system reacts to a partition by sacrificing availability - as it must, and you haven’t demonstrated differently - how can you claim it is CA?
If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.
I feel like I’m in bizarro world.
If the system sacrifices consistency (it could also be consistency, or both), then there is an overlap between CA and CP. That’s what Daniel Abadi said 5 years ago: “What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”
The key point is that forfeiting partitions does not mean they won’t happen. To quote Brewer (in 2012) “CA should mean that the probability of a partition is far less than that of other systemic failures”
That’s why there is an overlap. I can choose CA the probability of a partition is far less than that of other systemic failures, but I could have a partition. And if I have a partition I will be either non consistent, either non available, either both, and I may also have broken some of my system invariants.
I’m sure it does not help you as I’m just repeating my post, and this part is only a repetition of something that was said previously by others :-(
Trying differently, maybe the issue to understand this is that you have:
CAP as a theorem: you have to choose between consistency and availability during a partition. There are 3 options here:
full consistency (the CP category)
full availability (the AP category)
not consistent but only partial availability (not one of the CAP categories, but possible in practice, typically 2PC with heuristic resolutions: all cross-partition operations will fail).
CAP as a classification tool with 3 options: AP/CP/CA. There are a description of the system. CA means you forfeited partition tolerance, i.e. it’s a major issue for the system you build.
And, in case there is any doubt: most systems should not forfeit partitions. I always mention 2PC/heuristic because is a production proven exception.
Could you rephrase your statement? I am having trouble parsing what you have said.
the cr went away. let me edit.
If we take your second case - as it’s the only real case worth discussing, as you note :-) - how can you claim the system is available?
The system is CA under a clean network until time n when the network partitions. The partition clears up after m ticks. So from [1, n) and (m, inf) the system is CA, but from [n, m] it is unavailable. Can we really say the system maintains availability? That feels odd to me.
Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior; in the presence of a partition it’s better to die hard than give a potentially inconsistent answer.
Having said all of this, my distributed systems skills are far below those of the commentators here, so please point out any obvious missteps.
CA is forfeiting partition tolerance (that’s how it was described by Eric Brewer in 2000). So if a partition occurs it’s out of the operating range, you can forfeit consistency and/or availability. It’s an easy way out of the partition tolerance debate ;-). But an honest one: it clearly says that the network is critical for the system.
Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior;
Yep it works, Daniel Abadi solved the overlap by merging CA and CP (“What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”) It’s not totally true (a CA system can lose its consistency if there is a partition, like 2PC w/ heuristic resolutions), but it’s a totally valid choice. If you do the same choice as Daniel in CAP you choose CP for the system 2 above. CA says “take care of your network and read the documentation before it is too late”.
seriously, why are we worked up about poor “administration interfaces” here
:-) Because I’ve seen a lot of system where the downtime/data corruptions were caused mainly by: 1) software bugs 2) human errors.
I also think that a lot of people take partition tolerance for granted (i.e. “this system is widely deployed in production, so it is partition tolerant as I’m sure everybody has network issues all the time, so I can deploy it safely myself w/o thinking to much about the network”). Many systems are not partition tolerant (whatever they say). That’s why Aphyr’s test crash them (dataloss, lost counters,…), even if they are deployed in production.
It does not mean they have no value. It’s a matter of priority. See Aphyr’s post on ES, imho they should plan partition tolerance and implement immediately crash tolerance for example, instead of trying to do both at the same time.
I prefer a true “secure your network” rather than a false “of course we’re partition tolerant, CAP says anything else is impossible” statement (with extra points for “we’re not consistent so we’re available”).
CAP tells you that you can’t have both C and A when a partition happens. Most people take that to mean you must choose one or the other and have a CP or AP system. But it’s worth remembering that you do have the option of making sure that partitions never happen - either by making the system non-distributed or by making the communications reliable enough. And for some use cases that might be the correct approach.
 In a probabilistic sense - you can’t ensure that a network partition never happens, but nor can you ensure that you won’t lose all the nodes of your distributed system simultaneously. Any system will have an acceptable level of risk of total failure; it’s possible to lower the probability of a network partition to the point where “any network partition is a total system failure” is an acceptable risk.
I think it’s important to modify your statement a bit. What you have to do is ensure that in the face of a partition you remain consistent then try your darnedest to reduce the frequency of partitions. The distinction being you have control over what happens during a partition but not control over a partition happening.
you have control over what happens during a partition but not control over a partition happening.
I don’t think that this sharp distinction exists. You don’t have absolute control over what happens during a partition - to take an extreme example, the CPU you’re running on might have a microcode bug that means it executes different instructions from the one you intended. And you do have control - to the extent that you have control of anything - over the things that cause network partitions; you can construct your network (or pay people to construct your network) so as to mitigate the risks. It is absolutely possible to construct a network which won’t suffer partitions (or rather, in which partitions are less likely than simultaneous hardware failures on your nodes) if you’re willing to spend enough money to do so (this is rarely a smart choice, but it could be).
I do not think byzantine faults really matter for this discussion, they are a whole other issue to partitions. But I do not think your response invalidates my point at all. Partitions are something that happens to you, how your program handles them is something you do.
Most languages know better than to allow nilness to pass silently, just as they know better than to have asynchronism happen silently. The whole point of monads is to allow you to do this kind of thing in a low-overhead, but still explicit way.
The behavior described in the article didn’t seem very “monadic” to me either. Swallowing errors appears to be the opposite of what Monads are about - treating side effects explicitly. That said, I can definitely see the benefit of this behavior in certain cases, for example when it matters more to make something work than to make it correct.
Objective-C’s bottom propagation seems to work really well; perhaps it’s a cultural and not a technical distinction?
You’re absolutely right that implicit handling of nil can cause errors to be detected further from their causes–but the same is true of currying, which allows type errors to arise from widely separated parts of a program. Different languages have different habits around these kinds of convenience/safety tradeoffs.
FWIW, I find Clojure’s nil handling generally more convenient than a hindrance–functions which define nil often serve as their own base case in recursive algorithms, for example, which reduces the need for explicit branching. This is especially helpful when building up maps! And while I agree that Clojure’s default nil behaviors can make it easier to make mistakes, core.typed is quite effective at preventing NPEs when you want that degree of safety.
You’re absolutely right that implicit handling of nil can cause errors to be detected further from their causes–but the same is true of currying, which allows type errors to arise from widely separated parts of a program.
Are you equating runtime errors and type errors? I’ve not seen currying create confusing type errors. Confusing runtime errors, yes definitely, but if you have types - no.
Source-sink distances in GHC type errors are generally quite good, whereas I’ve seen things like Clojure vectors being functions create mind-bending source-sink distances in errors.
I think the way Swift handles this is a pretty happy medium. Values aren’t implicitly nullable, and nil propagation is explicit rather than implicit. This means that:
When you’re okay with nil propagating, you can very easily just chain optional values by doing something like:
let myValue : SomeType? = myObject?.doSomething()?.value as? SomeType
And myValue would be nil if myObject is nil. Useful in the cases when something like that is acceptable.
But if you absolutely need the value to not be nil, then you throw in an explicit optional unwrapping and you get crashes if it is nil:
let myValue : SomeType = myObject!.doSomething().value as! SomeType
And in this case myValue would either be non-nil or you’d get an exception.
In my own code, if I ever have methods return self, it is because I am trying to implement some sort of chainable API. This pattern is popular when constructing objects in Rust, and is often called the “Builder” pattern. That is, instead of implementing a constructor that takes 5 different parameters like this:
let mut my_obj = MyObj::new("some string", 1, SOME_ENUM, true, false);
You use the “Builder” pattern to make it much more clear (also more verbose, but hey, “tradeoffs”):
let my_obj = MyObj::new("some string")
The nice about this in Rust is, you can keep the mutability confined to object construction. After the object gets assigned to my_obj, it is considered immutable (you would have to write let mut my_obj to change this behavior).
let mut my_obj
I like builders and have written APIs that provide builder patterns, but I really prefer option maps where the language makes it possible. For instance:
let my_obj = MyObj::New("some string",
Why not use option maps everywhere? I suspect it has to do with type systems. Most languages only have unityped maps where any key is allowed, but options usually have fixed names and specific but heterogenous types. The option map above has booleans, integers, and enums, for example.
In languages like Java, it’s impossible to specify type constraints like “This map has a :foo? key which must be a boolean, and has a :mode key that can only be one of these three values”. Using a builder with explicit type signatures for each function lets you statically verify that the caller is using the correct keys and providing values of the appropriate type.^
Of course, all this goes out the window when folks start reading config files at runtime, because you can’t statically verify the config file, so type errors will appear at runtime anyway, but you can certainly get some static benefit wherever the configuration is directly embedded in the code.
^Know what a heterogenous map is in Java? It’s an Object! From this perspective, builders are just really verbose option maps with static types.
I agree that it’s a shame to sacrifice the composability of option maps. I would prefer a builder API which is sugar on top of merging option maps, with an easy way of getting to the option maps when I want that composability.
You can also have builder APIs which are pure and return new objects instead of self. In Rubyland, ActiveRecord’s Relation API is like this, which is great because intermediary results can be shared:
posts = Post.where(user_id: 1).order("published_at DESC")
latest_posts = posts.limit(10)
favourites = posts.where(favourite: true)
This provides one part of the composability you get from option maps, but not all of it. Unfortunately I don’t think the ActiveRecord::Relation API is built in a way that lets you build up relations with option maps when you want.
I’d argue that named parameters solve the opaque-list-of-arguments problem with much less complexity.
As with all things, it depends on what kind of complexity. It increases complexity in the language for a decrease in complexity in your code. This may or may not be worth it, depending on how much you increase complexity in the language.
Questionable; have you seen Python’s positional/named parameter assignment rules? Granted, they’d be much simplified by killing the *args and **kwargs constructs, but at a language level, named parameters are definitely more complicated. On the other hand, they do make life somewhat simpler for language users. It’s a tradeoff.
Regardless, I think either is a perfectly acceptable solution to the problem.
The article is very western-centric. Most of the world doesn’t do Daylight Savings Time anymore. Japan doesn’t, China doesn’t, Kazakhstan doesn’t, Russia doesn’t, even US state of Arizona doesn’t.
Gosh, just look at the wikipedia, they do have a nice visual map up there – pretty much the whole world doesn’t do DST anymore! I’d say, might as well keep your clock at Beijing Time, no need to bother with UTC.
The article does not recommend that one use DST; it simply advises that if one finds oneself operating servers in those locales (and I assure you there are a nontrivial number of servers in the US), it’s a good idea to avoid the hassle of DST.
Picking an arbitrary offset (e.g. Beijing time) works, but it does tend to complicate your time arithmetic somewhat. You must:
I’m sure I speak for everyone here when I assure you that I have never made any of these mistakes personally and they have never cost me weeks of my life. ;-)
I think extrapolating from “my program got slower with threads (and there was an easy fix)” to “most programs get slower with threads” is quite the leap.
I think the point is more: “it is easy to get multi-threading wrong and hurt performance in ways you may not expect if you’re unfamiliar with how multi-threading works.”
Multi-threaded programs can, and very often do, run much more slowly than the equivalent single-threaded program.
The point that I was trying to make is that Amdahl’s law gives us the wrong intuition about the performance of multi-threaded programs. The worst case of Amdahl’s law is a wash: the multi-threaded code runs in the same time as the equivalent single-threaded code. Unfortunately, that doesn’t match reality. In the real world, poorly written (or poorly optimized) multi-threaded code runs slower than the equivalent single-threaded program.
That doesn’t mean that threads are bad, just that they aren’t a magic ointment that makes all programs faster. If there is contention, especially if it’s contention that requires a context switch, they can make code go slower. Sometimes shockingly so.
The second think I was trying to talk about is how modern Linux has some very cool tools for tracking down these kinds of performance problems. perf (or perf-events) is extremely powerful, and combines a lot of what you get from profilers and strace into one package with much less runtime overhead. In addition, its ability to do system-wide profiling is very handy for cross-process interactions. Many other operating systems have equivalent, and some better, tools.
In the real world, poorly written (or poorly optimized) multi-threaded code runs slower than the equivalent single-threaded program.
I’ve done a lot of concurrent programming over the last four years, and this has almost never been my experience working with Erlang, Clojure, and java.util.concurrent, but YMMV I guess. I tend to see sublinear scaling owing to unexpected locks in the stdlib (hi Integer.parseInt), or known synchronization points like interning and queue contention, but I don’t think I’ve ever hit a real-world computational problem where I haven’t been able to get a ~4-10x speedup out of a 16-core box by slapping a threadpool on it. Usually takes a profiler to get to that near-linear domain though.
It was harder to get useful behavior out of multithreading in the bad old C/C++ days where there was heavy reliance on out-of-process locks. People know how to do things better than lock-spaghetti now.
Isn’t this just a fancy ircd?
Excuse me but how is this better than Hadoop?
best comment on lobsters to date
I suppose you could look at it that way, but I think there is more value than just that…
The main point is accessibility to non-technical team members, given the better UX of tools like this. Even being a developer myself, I don’t think I’d like IRC much without IRCCloud these days.
With Slack and similar services, there are also lots of integrations into other services, like GitHub, etc. You could set up bots on IRC for these things, but these services make it more accessible.
There are many other competitors in this space:
I don’t use Slack myself, but have used FlowDock at a previous company.
With a decent mobile client and ‘session syncronization’ by default.
not to mention a built-in bouncer
I was of the same opinion initially, but just started using this on my team for the last week and the integrations/sync/device support are pretty good – combining a lot of what we were using IRC and Google Hangouts separately for.
Whether exactly-once delivery is possible depends a lot on what you mean by “message”, “exactly-once”, and “delivery”.
First, ‘atomic broadcast’ IS possible. It is practically possible (and done frequently) to build a distributed system where multiple nodes process the same messages in the same order, all exactly once. This is behind the classic “distributed state machine” approach to building fault-tolerant systems. See http://dl.acm.org/citation.cfm?id=98167 for a classic paper in the area and http://dl.acm.org/citation.cfm?doid=1041680.1041682 for more information on the general topic. In short: building a totally ordered stream of messages and delivering them in the same order to multiple nodes is not only possible, but done frequently.
So far so good, but there are two big caveats here. One is that getting this right requires significant coordination, which comes with both availability and latency costs. The second is that it’s not really what people mean when they say “message delivery”. Most people mean that each message gets delivered once to one consumer, which does something with that message that has side effects. That becomes trickier, because we need to start talking about failure tolerance.
Consider the system where the queue hands the message off to the consumer, and does a handshake that makes both agree that the messages has been handed off. Now, the consumer goes off and does something with that packet that has side effects: it changes the world in some way. Finally, the consumer once again runs a protocol which makes both it and the queue agree that the message has been processed. What happens when the consumer fails?
apy’s post gives two of the possibilities for handling that case. There are others, but they aren’t any better.
The core problem here is that exactly once delivery is fundamentally at odds with fault tolerance. Exactly-once delivery and processing fundamentally requires that the act of processing, and hence knowledge about the fact the processing happened, is kept at just one place in the system. If that one place fails, the system needs to reconstruct that fact, but has no way to do so. It then needs to decide between re-delivering the message (and possibly having it processed multiple times) or dropping the message (and possibly having it never processed).
Ok, so it’s impossible. Where does that leave us? It should be pretty obvious to you that many real-world systems rely on exactly-once processing of tasks and messages. How can they do that if it’s impossible?
Think about Bob, who runs a pizza shop with online ordering. When people order from Bob, their orders go into Bob’s persistent queue. Bob workers take a pizza order off the queue, bakes it, delivers it, and goes back to the queue. Occasionally one of Bob’s workers gets bored and leaves early, in which case Bob gives the order to a different worker. Sometimes, this means that multiple pizzas arrive at the customer’s house (and never less than one pizza). On arriving, the pizza delivery guy asks the home owner if they had received a pizza with that order ID before. If the home owner says yes, the pizza guy takes the duplicate pie with him. If not, he leaves the pie. Each home owner gets exactly one pie, and everybody is happy.
Short version without pizza: exactly-once delivery is impossible. Exactly-once processing of messages is possible if the processing can be made idempotent.
I think mjb is exactly right. To expand on this a bit:
Implementations of distributed replicated state machines in the literature generally assume that operations, once received by a node, are atomically and durably logged to disk. Real disks are not so reliable, which often entails some degree of log replaying, where operations are journaled before being applied to some state machine and applied again to recover from a checkpoint in the event of failure. Moreover, running an operation on multiple replicas is assumed to be safe: if the operation does something like “Lower Gertrude one meter deeper into the volcano”, executing it on one versus three replicas could mean the difference between a successful sampling expedition and a very unhappy geologist.
Both of these constraints lead us to a hand-wavy notion that operations must be in some sense idempotent in the real world, and on each replica, they have to transform a state deterministically. These properties are key to crash recovery, but not all functions satisfy these properties.
What people generally mean by “exactly once delivery” of a message is something like “This function will be be invoked atomically and exactly once.” But we know this property is not, in general, satisfiable. Consider:
run_winch :counterclockwise, 1
Now imagine trying to call this function once on a single node, and knowing if we crash, whether the lowering has or has not occurred:
If we crash between logging and lowering, Gertrude remains a meter too high to grab her sample. What if, instead, we try
Now if a crash occurs between lowering and logging, the computer thinks that Gertrude still needs to go a meter deeper, even though she’s now at the correct altitude. The geologist and winch engineer wind up having a tense conversation punctuated by the odor of Gertrude’s burned boots. They decide instead to augment the lowering process with some extra information:
current_height = gertrude.rangefinder.height
Together, they’ve modified the task itself so that it is idempotent. Notice that they had to couple the idempotency of this function to the state it manipulates–e.g. Gertrude’s sensor package–so that it is safe to call multiple times.
tl;dr: In any message queue, or any transactional system in general, we cannot provide atomicity guarantees for arbitrary side-effecting functions. Those functions must be carefully designed to allow for repeated invocation, because repeated delivery is implicit in crash recovery where some state may be lost.
I think this covers the case of the subscriber client pretty well. Does this also cover the case of the broker also (meaning that we assume that a queue is a side-effecting data structure)?
It’s not clear to me that you can think about a broker without clients as a meaningful message queue.
So exactly-once is possible if there isn’t an in-order requirement? (I assume that’s what you mean by requiring idempotence)
No, perhaps I explained poorly.
Talking about idempotence was trying to explain how systems typically get around the problem of exactly-once being impossible. Basically, you embrace the fact that you can’t apply every operation exactly once, so you design for at-least-once. If you make your operations idempotent, then their effects end up being applied exactly once.
The goal of systems designed like this is delivery at-least-once (for completeness) and approximately-once (for efficiency). Idempotence then gives exactly-once effects at the cost of a little bit of efficiency. Obviously the challenge is designing operations that are idempotent (and commutative and associative if conditions require it).
Exactly-once delivery and processing fundamentally requires that the act of processing, and hence knowledge about the fact the processing happened, is kept at just one place in the system
Let me expand this, tell me if I’m wrong.
My first instinct is to say “No it doesn’t because the processor (client) sends an ACK back to the broker”. But the problem with my statement would be that the ACK may not arrive.
For instance, the server that I send the ACK to dies right after sending me the message so it doesn’t receive my ACK. The failover server picks up where the other server left off and resends the message, in which case I get the message delivered a second time even though I’ve already successfully processed it.
My rebuttal is “what if the load balancer has knowledge of which servers have a replica of the session and can smartly choose one of them to route the traffic to?”. This took me a while to figure out but I eventually realized that there’s still a gap between when the server becomes unavailable and when the cluster (and LB) realize that it’s unavailable. So there’s still plenty of time where my ACK will get dropped without the failover server recognizing it. Again, the problem here is that a network failure appears as as an unresponsive server, but so does a long garbage collection cycle (or many other normal, naturally occurring tasks).
In order to solve exactly-once delivery by adding another node…first you must solve exactly-once delivery. Just, inductively, if it’s impossible to solve with N nodes, it will be impossible to solve with N + 1 nodes.
EDIT: Also note that this problem is not just queues, it’s any communication. Exactly once HTTP requests are impossible as well.
IronMQ does not guarantee exactly-once delivery; failure or delay in either the client or the network can result in multiple delivery attempts. I don’t understand how this got published; it’s like claiming to have solved two-generals over the public internet.
As with most articles written from a single point of view, I’m having trouble discerning what really happened here.
On the one hand, we’re asked to believe that Kathy issued no DMCA takedowns due to lack of evidence. Fair enough. I believe her.
On the other hand, we’re asked to believe that Andrew doxxed Kathy without any evidence provided. (Newsflash: I’ve “admitted” to plenty of things over the years to get a guy into bed, that doesn’t mean that I actually did any of them.)
Obviously, Kathy has faced some very difficult times, and for that I certainly feel for her, but the addition of logical fallacies peppered about in this post make it a little tough to chew.
She directly addresses this in the post: even if weev wasn’t the one who doxxed her, he very clearly endorsed the doxxing. We need to speak out against people endorsing that behavior, just as much as we need to speak out against the perpetrators.
Over a candlelit dinner of tuna sashimi, Weev asked if I would attribute his comments to Memphis Two, the handle he used to troll Kathy Sierra, a blogger. Inspired by her touchy response to online commenters, Weev said he “dropped docs” on Sierra, posting a fabricated narrative of her career alongside her real Social Security number and address. This was part of a larger trolling campaign against Sierra, one that culminated in death threats. Weev says he has access to hundreds of thousands of Social Security numbers. About a month later, he sent me mine.
Now can we please get back to not doubting the victim, especially where, you know, evidence is published in a major newspaper? I’ve seen quite enough of your posts regarding women and LGBT people here already; it’s leaving an acid taste in my mouth.
Clearly: journeysquid is an obnoxious misogynist, and equally clearly, weev either doxxed her or strongly endorsed it, which are roughly morally equivalent. You’re correct on those points.
But: “not doubting the victim” can be a fallacious approach. One should maintain a healthy skepticism about any claim, in any field; whether or not the claimant may in fact be the victim of something terrible. Terms like “not doubting the victim” can be, and sometimes are, used to shut down critical discussion in much the same way as terms like “warmist” and “denier” in the field of climate change.
out of interest, what logical fallacies made it hard for you to read?
Wait, Ruby doesn’t have constant folding for literals?
I’m not a rubyist, but I think that it’s because symbols are supposed to take the place of string literals, and ruby strings are mutable by default.
Symbols are interned automatically, and you’re expected to use symbols in most places where you would use interned strings in other languages. My guess is that symbols are supposed to be thought of as keys with semantic meaning (and since each key is meaningful, may be reused many times) whereas strings are thought of as fancily encoded array bytes, and may be sliced and diced and combined with other strings any which way.
Strings are mutable by default, and you must “freeze” strings (as noted by @benolee) if you want an immutable string. If string literals were interned by default, and we also had mutable strings, they would have very odd semantics. There are two options:
We intern mutable strings. This means that when I ask for a “foo” but someone else has already used “foo” in the code, and then modified it to be uppercase, I actually see “FOO” when I inspect my “foo”.
We intern immutable strings, but COW them and adjust the reference of the var so that they turn into mutable ones later on. So when I check whether my two literals strings “bar” are references to the same object, they evaluate to true! But when I uppercase one or both of them, then they point to separate objects. Or else we create a copy when we inspect the object so that a equal? b is consistent. So this means that equal? may entail an object allocation … again this is a very weird semantic.
So yeah, it seems a little unusual to me, since I spend most of my time on the JVM, but I guess it works for them.
Ruby 2.1 added some optimizations for frozen strings: http://tmm1.net/ruby21-fstrings
Not for literal String anyway.
This is another fantastic article about distributed systems testing by @aphyr. I have been really glad to see that he’s been breaking down how to actually use Jepsen–perhaps we’ll start seeing more and more people applying Jepsen to their projects.
With respect to etcd in particular, this has finally pushed me over the edge to the point where I’d consider building a system with etcd, rather than only considering ZK.
I would actually hold off if safety is critical. Given how ZK, Doozer, Chubby, etc went, it’ll be another five years or so before they iron out all the kinks, haha.
Yeah I was gonna say, this seemed to me to argue that sticking with Zookeeper for now is probably the right choice.
One relevant question for @aphyr: IIRC your original ZK article was with an old version of Jepsen without the linearizability checker, have you tested ZK with knossos at all?
Not yet, no. Each post is between 50-100 hours of work, and this is all nights+weekends, so it takes a while.
Yeah I understand, just curious. Thank you for doing these, we all appreciate it :)
Is there any way I can tip you? A paypal account maybe? I have learnt a lot about distributed systems from your blog. I want to show my appreciation by sending some funds.