1. 13

Would’ve liked to have seen specific code examples, and more details on the perf improvements, which seem substantial (750ms to 14ms), but are only quoted as “average latency”, which I’ve come to learn is a fairly useless measure – especially when running on the JVM, where garbage collection pauses can be substantial.

  1.  

  2. 6

    I had the same thought as the poster’s comment, seeing distribution of tail latency would have been good. Nonetheless, it seems like there is a big speed up. I’d be very interested in seeing a report of people doing a rewrite in the same language. It seems like they chose a far better architecture the second time around, so it’s hard to say how much Scala helped vs the rearchitecting. Also, I don’t think I saw any discussion of scaling different number of machines, so I assume they kept that constant?

    1. 4

      I work on finagle, which underlies the technology that duolingo switched to (finatra) so I have a horse in this race, but I wanted to talk a little more about what you were saying about averages being useless.

      The tricky thing is that latency is something we get when we make a request and get a response. This means that latency is subject to all kinds of things that actually happen inside of a request. Typically requests are pretty similar, but sometimes they’re a little different–like the thread that’s supposed to epoll and find your request is busy with another request when you come in, so you wait a few extra microseconds before being helped, or you contend on a lock with someone else and that adds ten microseconds, and then all of those things add up to being your actual latency.

      In practice, this ends up meaning that your latency is subject to the whims of what happens inside of your application, which is probably obvious to you already. What’s interesting here is what kinds of things might happen in your application. Typically in garbage collected languages the most interesting thing is a garbage collection, but other things, like having to wait for a timer thread, might also have an effect. If 10% of your requests need to wait for a timer thread that ticks every 10ms, then they’ll create a uniform distribution from the normal request latency + [0, 10ms).

      This ends up meaning that when people talk about normal metrics being mostly useless for latency, this is usually because they mean the aggregate latency, which has samples which had to wait for the timer thread, and has samples which had to sit through a garbage collection, etc. However, it isn’t that the distribution they construct are particularly odd, but more that the distributions are composed of many other quite typical distributions. So there’s a normal distribution for the happy path, and then a normal distribution for the garbage collections, and a uniform distribution for when you were scheduled on the timer, and put all together they end up making a naive average difficult to interpret.

      But we can make an informed guess here, which is that probably the happy path is around 10ms now, and was probably around 750ms before, which is a quite nice improvement. As far as the unhappy path, my suspicion is that JVM gc pauses are better than Python gc pauses, but it’s quite difficult to tell for sure. My guess would be that their gc pauses are on the order of 100s of milliseconds, and were previously also on the order of 100s of milliseconds, so that the p9999 is probably still better than the p50 they saw previously.

      Anyway, this is just to say that averages are useless, but also that just knowing the p50 or p99 is also sort of useless. Really what I want to be able to see are the actual histograms. As a side note, finatra exports real histograms, so if you get ahold of one of the duolingo people, would be pretty interested to see some of those graphs.

      1. 2

        Agreed, a histogram – or any more details around performance – would have been useful. It’s unclear what they measured and what was sped up, so it’s hard to evaluate anyway.

        And this is the problem with precision, but not accuracy: if you’re telling me 750ms and 10ms, that send me a different signal than 750ms and 14ms. In fact, if I wasn’t going to dive deep into the perf aspects, I might have either dropped numbers altogether (and stated “more than an order of magnitude improvement”), or said “50 times faster”, and then I would’ve gotten the gist of the speedup (which seems awesome) without tripping over the concrete numbers (especially 14).