1. 19

  2. 3

    Looks like you may be the author. Small typo here

    We looked into it and while we’re bsuy ramping up our platform, didn’t take the time to devote to fixing it.

    Nice write up. We were investigating an issue like this at my workplace that turned out being network latency, but we were considering the issue you linked as well [1]. Although our limits are very high, we may still run into this issue. Is the only fix to look for an updated kernel? From the Kubernetes issue I can not see much movement on it.

    [1] https://github.com/kubernetes/kubernetes/issues/67577

    1. 3

      Thanks! Typo is fixed. The weird thing about this is that the enforcement of any CPU limit is the problem. So, setting a high limit doesn’t help. Your only options are to not use CFS-based CPU limits (e.g. use CPU shares, which are nowhere near as good otherwise for normal web service workloads) or upgrade the kernel to a patched version. You have to carefully be sure that this fix is included in the kernel you upgrade to, however, as it is available in older kernels only as a back-ported patch and vendors may or may not do that. Recent “proposed” track Ubuntu kernels seem to include it.

      1. 1

        The weird thing about this is that the enforcement of any CPU limit is the problem. So, setting a high limit doesn’t help.

        Thanks for pointing that out. I’ll definitely do some tests in our lab after holidays to see if we’ll be affected. The way it was described to me previously was that it should only affect app’s with tight CPU limits, but it seems like that’s not the case. A side note for anyone looking to see if they have this problem, it looks like there’s a good Go repro here


    2. 1

      Can this affect plain Docker containers, too, outside of k8s or Mesos? From a quick Google it looks like Docker does use CFS (completely fair scheduler)

      1. 2

        Yes, it can. Docker supports both cpu limiting mechanisms so it depends on how you are setting up your containers.

        1. 1

          As far as I can tell we’re not setting any of the --cpu-* flags when we’re starting our Elasticsearch containers (the application I was worried about) so I think we’re probably not affected. Also, docker inspect is showing 0 for the various CPU options, which I assume means “not set”.

      2. 1

        Hi! Your <a> styling is a little hard to see on a mediocre monitor like mine. It took me a while to find the this issue link from the text. Liked the article!

        1. 1

          Interesting, thanks for the feedback!

        2. 1

          I wish the article had at least included the kernel versions affected. Or even better pointed to the kernel commit that fixed the issue. Right now all we know is that something broke performance and how to fix it for Ubuntu 16.04.

          1. 1

            Did you follow the link to the kernel issue? That has the patch and you can find it in your favorite vendor’s kernel change log. https://lkml.org/lkml/2019/5/17/581