1. 55
  1.  

    1. 9

      I got involved in a funny little conversation about how Lobsters’ performance informs this post. (backstory)

      1. 6

        We should soon know if my expectations hold true for lobste.rs: https://github.com/lobsters/lobsters/pull/1442

        1. 11

          If anyone’s curious, this is now live in prod. I’ll remove it roughly tomorrow morning (depends on my social calendar). I pasted a few lines from the log to the PR so it’s obvious there’s no personal info going to byroot.

          I know I’ve said this before, but one of the reasons I find maintaining the Lobsters codebase rewarding is that it’s landed in a sweet spot where it handles real-world complexity and some scale, but is small, standard-style, and with few enough dependencies that devs can find their way around quickly. People have used it for corporate training on writing good tests, to learn or experiment with Rails, to improve Ruby itself, to design a database, to crib small examples from. We’ve benefited hugely from the generosity of the projects we’re built on and we get to share onwards in turn.

          1. 2

            I’m excited to see the results!

      2. 7

        This is a very interesting article. I was originally taken aback by the initial “Not IO bound” comment, but pointing out that our current understanding of IO is actually conflated with the OS scheduling of threads was very on point. I hadn’t considered that before. I think my reaction still stands though, but in a pedantic way. Looking at:

        YJIT only speeds up Ruby by 2 or 3x

        and

        Like Discourse seeing a 15.8-19.6% speedup with JIT 3.2, Lobsters seeing a 26% speedup, Basecamp and Hey seeing a 26% speedup or Shopify’s Storefront Renderer app seeing a 17% speedup.

        I still feel that if a component sees a 2-3x perf increase and that translates to (1.15x-1.27x) improvement that it’s a significant component (and well worth optimizing), but it isn’t the dominant/limiting factor.

        Towards the end of the article Jean gets into some specific numbers regarding “truly IO bound” being 95% and “kinda” being 50%. I asked him on Mastodon about them. https://ruby.social/@byroot/113877928374636091. I guess in my head “more than 50%” would be what I would classify as “IO bound.” Though I’ve never put a number to it before.

        Someone tagged an old thread of mine in a private slack recently where I linked to this resource https://www.youtube.com/watch?app=desktop&v=r-TLSBdHe1A. With this comment

        Samuel shared this in Ruby core chat, and (spoiler) that’s actually one trick for debugging performance. They want to answer the question “Is this code worth optimizing” i.e. “if we made this code 2x faster…would anyone care.” Because if you make something 100x faster that only accounts for 1% of your total time, people aren’t really going to notice.

        So they can’t arbitrarily make code faster, but they CAN make code arbitrarily slower. So the program simulates a speedup of one code section by making all the other code slower to report if it’s worth optimizing it or not. An all around interesting talk. It’s very aprochable as well

        It would be interesting to have some kind of an IO backend where you could simulate a slowdown. I.e. perform the query on the database and time it, then sleep for some multiplier of that time before returning. It would (in theory) let you put a number to how much your app is affected by (database) IO. If you set a 2x multiplier and you see requests take 2x as long…then you’re approaching 100% IO bound.

        The linked GVL timing gem is new and interesting. Overall, thanks for writing all this down. Great stuff.

        1. 4

          I guess in my head “more than 50%” would be what I would classify as “IO bound.”

          Perhaps I should have written about that in my conclusion, but ultimately IO-bound isn’t perfect to describe all I want to say.

          In a way it’s a good term, because the implication of an IO-bound app, is that the only way to improve its performance is to somehow parallelize the IOs it does, or to speedup the underlying system it does IOs with.

          With that strict definition, I think YJIT proved that it isn’t the case, given it proved it was able to substantially speedup applications.

          A more relaxed way I tend to use the IO-bound definition, in the context of Ruby applications, is whether or not you can substantially increase your application throughput without degrading its latency by using a concurrent server (typically Puma, but also Falcon, etc).

          That’s where the 50% mark is important. 50% IO implies 50% CPU, and one Ruby process can only accommodate “100%” CPU usage. And given threads won’t perfectly line up when they need the CPU, you need substantially more than 50% IO if you wish to process concurrent requests in threads without impacting latency because of GVL contention.

          So beyond trying say whether apps are IO-bound or not, I mostly want to explain under which conditions it makes sense to use threads or fibers, and how many.

          1. 3

            50% IO implies 50% CPU, and one Ruby process can only accommodate “100%” CPU usage. And given threads won’t perfectly line up when they need the CPU, you need substantially more than 50% IO if you wish to process concurrent requests in threads without impacting latency because of GVL contention.

            Are you comparing multiple single threaded processes to a single multithreaded process here? Otherwise, I don’t understand your logic.

            If a request takes 500msec of cpu time and 500msec of “genuine” io time, then 1 request per second is 100% utilization for a single threaded server, and queue lengths will grow arbitrarily. With two threads, the CPU is only at 50% utilization, and queue lengths should stay low. You’re correct that there will be some loss due to requests overlapping, and competing for CPU time, but it’ll be dominated by the much lower overall utilization.

            In the above paragraph, genuine means “actually waiting on the network”, to exclude time spent on CPU handling the networking stack/deserializing data.

            P.S. Not expressing an opinion on “IO bound” it’s not one of my favorite terms.

            1. 2

              Are you comparing multiple single threaded processes to a single multithreaded process here?

              Yes. Usually when you go with single threaded servers like Unicorn (or even just Puma configured with a single thread per process), you still account for some IO wait time by spawning a bit more processes than you have CPU core. Often it’s 1.3 or 1.5 times as much.

              1. 2

                I don’t think there’s anything special about 50% CPU. The more CPU time the worse, but I don’t think anything changes significantly at that point, I think it’s going to be relatively linear relationship between 40% and 60%.

                You’re going to experience some slowdown (relative to multiple single-threaded processes), as long as the arrival rate is high enough that there are ever overlapping requests to the multithreaded process. Even if CPU is only 1/4 of the time, your expectation is a 25% slowdown for two requests that arrive simultaneously.

                I think, but am not sure, that the expected slowdown is “percent cpu time * expected queue length (for the cpu)”. If queue length is zero, then no slowdown.

          2. 1

            You can achieve some of that with a combination of cgroups limiting the database performance and https://github.com/Shopify/toxiproxy for extra latency/slowdown on TCP connections.

          3. 5

            I agree with a lot of this. I actually think the article is being too kind, in the wild I’ve seen that many Rails apps degrade into more and more CPU-bound, mainly due to coding decisions that they’re more or less expected to make (the Rails Way).

            Most Rails performance issues are database issues

            I’ve worked on some projects were rendering out the template was about an order of magnitude slower than the DB query itself.

            I’d also argue a lot of the database issues are because Rails divorces you from the DB layer too much and favors more inefficient database design decisions (no materialized views, etc).

            Most apps also use a background job runner, Sidekiq being the most popular, and background jobs often take care of lots of slow IO operations, such as sending e-mails, performing API calls, etc.

            Sidekid ends up being the dumping ground for workloads that are too slow to do in-band on the request path. These include converting files, performing calculations, but also things that look IO-bound but have CPU-intensive work due to Ruby being slow such as loading thousands ActiveRecord objects in memory and iterating over them.

            However, over the last couple of years, many people reported how YJIT reduced their application latency by 15 to 30%. Like Discourse seeing a 15.8-19.6% speedup with JIT 3.2, Lobsters seeing a 26% speedup, Basecamp and Hey seeing a 26% speedup or Shopify’s Storefront Renderer app seeing a 17% speedup.

            Many of these companies actually take performance very seriously and were probably commiting fewer mistakes. I imagine the speedup would be even more significant in some companies without that know-how. That is to say the average large Rails app is in worse shape than Shopify’s.

            1. 3

              I’ve worked on some projects were rendering out the template was about an order of magnitude slower than the DB query itself.

              I’m curious about how you measured that. Several tools show view rendering time as slow but it was because the database fetching only hit during the template rendering (because Active Record is lazy)

              1. 1

                I guess that part got confusing, I meant for that sentence to be read in conjunction with the first part about Rails leading one towards inefficiency. Yes, it was because N+1 queries and lazy loading popping up in the render phase. My point was that while those are problems in the database interaction the DB itself was not an issue, just the misuse of it which was mostly caused by following Rails recommendations. The ideal query eventually ran ~10x faster after a different schema design.

            2. 1

              It actually means that performing the query and getting the thread scheduled again took 20 milliseconds

              Normally, there would be a distributed trace that shows that the client span took 20ms and the actual query took 1ms. If such a margin shows up a lot some service mesh aware tools might raise an alarm and help the devs prioritize looking into the app server performance.

              In some cases there might be network congestion involved (and let’s admit you’ve heard “this is network!” a lot when it wasn’t) but not likely for the whopping 19ms for P50. This type of discrepancy would quickly get demystified on a highly loaded app, IMO.

              Also, if the telemetry is mature enough the traces should also show some system health metrics for the machine that ran the client span where we would see that the CPU consumption was near 100%. That’s a good sign that we cannot trust the time on that machine much, thus the client span is not reliable either. The next step would be running a profiler on the busy node to find out what’s going on. Likely there will be something about spending a lot of time in some JSON module. So, again not much space for mystery.

              If the overall CPU consumption is low then yeah we’re in trouble with the app’s performance. But if it fits the SLOs and does not lead to any incidents or an excessive cloud bill, then… let it be?