1. 9

Lately I’ve been talking to a lot of people about their tech stacks and people have spoken favorably about things which I have heard bad things about in the past. Two examples:

  • MongoDB was/is infamous for dropping writes and I had a horrible experience with it a few years back. However I’ve heard that this is no longer the case; write confirmation is enabled by default and the replication strategies have gotten better. I might not use it but maybe it’s not as bad as it was.

  • At Twilio in 2011 we had a rule not to use any Amazon service backed by EBS, including RDS, Route 53, ELB, SimpleDB, etc, etc. This saved us several times including the Christmas Eve outage. However time has passed and many people report positive experiences using those tools.

The line for “what’s good enough” is constantly moving, and if you are writing everything yourselves these days you’re probably not spending your engineering time wisely. How do you weigh/evaluate technologies, especially as they change over time? There are probably two levels here.

  • how do you evaluate whether “X-as-a-service” is of a high enough quality now? eg “Having a third party host your database is probably good enough now (RDS, Heroku Postgres)” and

  • how do you evaluate changes in a specific tool over time? eg Mongo, ELB..

Thanks, Kevin


  2. 9

    The short, unhelpful answer is that it depends on what you’re doing.

    The somewhat longer, more useful answer is that there’s no good way to do this, so you should do it only a couple of times if you can. Concretely, in the OSS web infrastructure world (which is where I am from), you will typically pick a small set of very flexible infrastructure projects that you know really well, and deploy them everywhere, for as many things as you can. It’s easier to locate errors. It’s easier to deploy. It’s simpler to reason about the infrastructure.

    A concrete example is, if you have a really kickass Hadoop team, then it’s worth it to phrase your problems as MapReduce jobs if you can, even if it’s a slight abuse of Hadoop, because then you can just farm it out to your Hadoop cluster, and your problem is solved incidentally by your Hadoop team. Same goes for Redis, Riak, RabbitMQ, whatever.

    Another thing to consider is that, in most cases, your team’s competence will limit you much sooner than your stack will. This is another reason to make big infrastructure choices as little as possible: it lets you deal primarily with one issue (your teams competence) rather than two issues (competence AND crazy stack that you don’t understand).

    1. 4

      Here is my general process, in some stream-of-consciousness form:

      • What are the theoretical limits of what the product claims to accomplish. Do they say more than is possible? If so, that is a big red herring. It doesn’t mean that you cannot use what they are offering but it does mean one has to delve much deeper. Using MongoDB as an example, it claims amazing performance, sharded, etc. But looking at the details, their data model does not work for an AP system, so it needs to be CP (or somewhere close). They use asynchronous replication, which means you cannot be a CP system unless you want to lose data. Therefore, by the theoretical bounds, MongoDB has to lose your data. Redis Cluster and Sentinel have the same problems. With that information, can my problem afford losing data? If Yes, then investigate the next thing if No then toss it out.

      • If the text on the tin matches what is possible the question is how well is it written. One option is to outsource this by waiting. Other people will use it, find issues, generate a reputation, etc. Another is to depend on the reputation of the authors, if they have one (many Go reviews mention “It’s from Rob Pike” at one point or another). And finally, you can come up with tests and evaluate it yourself. This means you have to know what you’re looking for and how to test it. This is very time consuming.

      • But that might be too time consuming, so if your initial use case is unimportant enough, you can dip your toes into a new product with that. If it crashes and burned you’ve lost some time, but that might be ok.

      • You can also run in a hybrid mode where you move some portion of your system to the other one. If you’re using a database you can do writes to both and reads from one or something similar. For the EBS example, some portion of your fleet could be on EBS with redundancy on non-EBS machines. That way if EBS fails you can just pull the machines out of traffic and investigate.

      • But then there is the problem of ‘black swan’ events. What if EBS only breaks once every quarter? You need to make sure you run your tests or hybrid mode long enough to have seen this scenario.

      In the end though, it comes down to understanding your problem and the possible solutions, generally biasing for simplicity. At this point, I would probably not run a database unless it has a long history of success or is based on technical papers I can read that explain it sufficiently. But I read a lot of academic work around my interest area so I am capable of making reasonable judgements based on the documentation of a product.

      1. 2

        But I read a lot of academic work around my interest area so I am capable of making reasonable judgements based on the documentation of a product.

        One should avoid confusing documentation with, what’s actually implemented. You can claim to implement Raft, or some other consensus protocol all you want, but that doesn’t mean it’s actually correct.

      2. 3

        I think a migration strategy helps. Pick two or three options from the same space, then try to draw a box around their common functionality. Use only the features in the box. If your good enough tool blows up, you move to another option. This won’t save you from some operational failures (AWS/EBS going down entirely), but can help if you notice “brownout” damage, like highly variable EBS latency. Then you move your storage to another platform. The difficulty of migrating depends on how deep the hooks are.

        Mistakes will happen, so it’s more about dealing with them than avoiding them. The solution to all problems is another layer of abstraction. :) (Except when it’s one less layer of abstraction. ha)

        1. 1

          This has always been the “proper” use of modules: draw abstraction boundaries to isolate decisions.