1. 5

    In general, this was an interesting read but because I think it blames the technology a bit too much I’d like to point out, that:

    Before extracting Quiz Engine MySQL queries from our Rails service, we first needed to know where those queries were being made. As we discussed above this wasn’t obvious from reading the code.

    This is probably not what most people would do here as any kind of APM tool would clearly show you which query is executed where (among other things). For deeper investigations, there are things like rack-mini-profiler, etc.

    To find the MySQL queries themself, we built some tooling: we monkey-patched ActiveRecord to warn whenever an unknown read or write was made against one of the tables containing Quiz Engine data.

    Even if you don’t use APM, there is no need for any monkey-patching, you can simply subscribe to the feed with instrumentation API https://guides.rubyonrails.org/active_support_instrumentation.html#active-record

    Quiz Engine data from single-database MySQL to something horizontally scalable

    You don’t provide the numbers, so I can’t say anything to the scale you are dealing with, but in general, I wouldn’t want anyone reading this to think MySQL isn’t very scalable because it is horizontally and in other directions as well.

    1. 7

      For anyone else wondering, APM is an abbreviation for Application Performance Monitoring (or Management). It’s a generic term, not Ruby-specific.

      1. 6

        I think some of the “blame the tooling” comes from how starkly different it feels for us to use Haskell versus Ruby + Rails.

        With Rails, it’s particularly easy to get off the ground – batteries included! But as the codebase grows in complexity, it becomes harder and harder to feel confident that our changes all fit together correctly. Refactors are something we do very carefully. We need information from tests, APM, etc. Which means legacy code becomes increasingly hard to maintain.

        With Haskell, it’s particularly hard to get off the ground (we had to make so many decisions! which libraries should we use? How do we fit them together? How do we deploy them? Etc). But as our codebase has grown, it’s remained relatively straightforward and safe to do refactors, even in code where we have less familiarity. We have a high degree of confidence, before we deploy, that the code does the thing we want it to do. As the project grows in complexity, the APIs tend to be easier to massage into the direction we want, rather than avoiding improvements because of some kind of brittleness / fear of regressions / fighting with the test suite.

        For those that haven’t written a lot of code in statically-typed ml languages like elm, f#, or haskell, the experience of, “if it compiles it works” feels unreal. My experience with compiled languages before Elm was with C++ and Java, neither of whose compilers felt helpful. It’s been a learning experience adopting & embracing Elm, then Haskell.

        1. 2

          This is probably not what most people would do here as any kind of APM tool would clearly show you which query is executed where (among other things). For deeper investigations, there are things like rack-mini-profiler, etc.

          I agree this information can also be found while monitoring, and we did rely on our APM quite a bit through the work (though this is not mentioned in the blog post), for example to see whether certain code paths were dead.

          A benefit of the monkey patch approach I think, was that it was maybe easier to interact with programmatically. For example: We made our test suite fail if it ran a query against a Quiz Engine table, and send a warning to our bug tracker (Bugsnag) if such a query ran in staging and production (later we would throw in that case too).

          Didn’t know about the AR feed. That looks like it would have been a great alternative to the monkey-patch.

          Regardless, our criticism here isn’t really related to Rails tooling available to dig for information, rather that we would have liked not needing to dig so much to know where queries were happening, i.e. that being clearer from reading code.

          1. 2

            What APM tools did you use to give what info/data and what didn’t they provide that you needed to use other tools to fill the gap for?

            1. 1

              What APM tools did you use to give what info/data and what didn’t they provide that you needed to use other tools to fill the gap for?

              We primarily use NewRelic in our Rails codebase and Honeycomb in our haskell codebase.

              NewRelic is a huge product, and I bet we could have gotten more use from NQRL to find liveness / deadness, but we didn’t.

              We used NewRelic extensively to find dead code paths by instrumenting code paths we thought was dead and seeing if we saw any usage in production.

              For finding every last query, we wanted some clear documentation in the code of where queries were and where queries weren’t. NewRelic likely could have provided the “where” but our ruby tooling let us track progress of slicing out queries. The workflow looked like this:

              • Disable queries in a part of the site in dev (this would usually be at the root of an Action)
              • Ensure a test hits the part of the site with the disabled queries
              • Decorate all the allowed/known queries to get the test passing
              • Deploy, and see if we saw any queries being run in the disabled scope
                • if we do, write another test to ensure we hit the un-covered code path. Decorate some more.

              It looked something like this:

              SideEffects.deny do
                # queries here are denied
                 data = SideEffects.allow { Quiz.find(id) # this query is allowed }
              end
              
        1. 9

          Good writeup! One interesting thing is that it was harder to find where database calls were happening in the “abstracted” ruby than in the “inline” ruby. Best practices for functionality conflicting with best practices for optimization.

          1. 11

            For anyone with some Rails experience, this is a well-known problem: ActiveRecord has magic accessors which will make a query on-the-fly when you try to access a property that hasn’t been loaded. It then caches it so when you next access it, it will not perform a query. Django has similar behaviour (although it is slightly simpler). This can result in extremely hard to fathom code. “Is this doing a query? Why/why not?” is what you’ll be asking yourself all the time, and there are often no clues in the code you’re looking at: some other method may have already cached the data for you once you hit the code you’re looking at.

            Sometimes, a method can be a bottleneck because it does a lot of queries, but only on a particular code path, because when coming through another code path, it will be passed a pre-cached object. Doing performance analysis on such code bases can be quite frustrating!

            1. 4

              Is there a generally-accepted way to deal with this problem? Is it just “don’t do that”? Asking as someone with little to no Rails experience.

              1. 4

                “It’s slow” is always hard to debug; “It’s slow, but only sometimes” more so.

                As an experienced rails dev, I lean heavily on runtime instrumentation in production.

                Looking for “where is the webserver spending most of its time” rapidly identifies the highest-priority issues; every serious instrumentation platform offers a straightforward way to view all call-sites which generate queries, along with their stack traces. From there it’s pretty easy to identify what the issue is; deciding where to add an explicit preload is the hardest part.

                1. 3

                  I don’t know, but AFAIK you’re supposed to know what your access pattern is going to look like and prefetch/side load when needed. Or discover it when you find out there’s a bottleneck and fix it then.

                  1. 3

                    generally and rails-specific are pretty different.

                    For ActiveRecord, there’s bullet, which helps avoid n+1s. But there’s no rails-native way of doing it, as far as I know. Lots of teams re-invent their own wheels.

                2. 6

                  Interesting observation! It reminds me about this post. (My) TL;DR there is that extracting functions only helps for pure code; when there are mutations to global state, it’s easier to see them all in a single place. It seems that OpenGL context and database connection are similar cases.

                  This immediately suggests a rule-of-thumb for rails apps: do not abstract over SQL queries. When handling a single HTTP request, there should be one function, which lexically contains all the SQL queries for this request. It can call into helper (pure) functions, as long as they don’t issue queries themselves.

                  1. 3

                    It can call into helper (pure) functions, as long as they don’t issue queries themselves.

                    As I pointed out in my sibling comment to yours, this is difficult to verify. You could have one method that performs a query and side-loads related objects from the other side of a belongs_to relation, and another that does some processing on those objects. But processing the objects requires following the relation, which will trigger a query when the objects haven’t been side-loaded. So you could have another method that forgets to do the side-loading and voila, your “pure” helper function now issues a query under the hood.

                  2. 4

                    Yes! Though I think even for functionality the inlining proved helpful. We encountered many cases where we were calling a function that took an array of elements but passed it only a single element. After inlining such a function we could remove a loop. In other cases we were calling a function with a hardcoded argument, and after inlining that we could remove unused conditional branches. It was super interesting to see how much logic evaporated in this process. Code that was useful at some point, but not anymore.

                    I like what Sandi Metz wrote on this subject in The Wrong Abstraction.

                  1. 4

                    I would have loved to hear more about the Haskell replacement, it seems like only a few short paragraphs near the end of this excellent blog post were devoted to it.

                    1. 3

                      we’re going to write a series of posts about Haskell at NoRedInk. Coming soon!

                      1. 2

                        first on a series of blogs posted: https://lobste.rs/s/frnnq4/bridging_typed_untyped_world

                        1. 1

                          Thank you!

                      1. 3

                        I wonder if the team considered writing the data store backend as a library in Rust, with Ruby bindings. That way it could run in-process, without the overhead of IPC or network round-trips. Rust provides strong static typing like Haskell, so the team would still have gotten the benefits of that.

                        1. 15

                          We didn’t consider Rust, but we did, as documented, build a sidecar app at first which brought latency down low enough. I’d be excited to adopt some Rust at NoRedInk though!

                          Our experience with Haskell has been great, and the learning curve is particularly shallow when all of our developers already write Elm.

                          The reason to pull out to a new service (as well as the reason to avoid different technologies) often has to do with decision fatigue rather than optimal architecture: We’ve built a nice template for an HTTP service in Haskell with, for instance, nice monitoring, deployments, etc.

                          Moving to a new language takes time to make all of the right decisions. The same is true for service shape. How do we monitor, set up permissions, deploy secrets, etc to our sidecar app? Do we want to build & maintain that deployment tooling? In our case: we decided it was worth the latency hit to be able to share tooling.

                          1. 8

                            Our fearless leader is extensively using Rust (and Zig) in the programming language he’s developing, Roc.

                            1. 4

                              Oh, that language looks very interesting. I think a “friendly” functional programming language that compiles to binaries would be a great addition.

                              If they haven’t seen it, this paper might be relevant to their work:

                              Perceus: Garbage Free Reference Counting with Reuse

                            2. 4

                              Although I use neither, Haskell and Rust look like a good fit since they both focus on safety with support for FP and easier concurrency. If you use Rust, your main obstacle will be borrow checker. Data-driven design helps there. If needed, you can cheat by using refcounting in Rust on what you can’t borrow check. You still get benefits of both its compiler and any 3rd-party code that does borrow check.

                              So, Rust can be gradually learned and adopted as an accelerator for Haskell projects. If not accelerator, just a way to leverage the good things in its ecosystem (esp networking and data stores).

                              1. 17

                                Haskell, when written with care, can often be roughly as fast as rust. It’s a natively compiled language that uses llvm for optimization. With stream fusion and other high level optimizations, you can end up with tight, vectorized asm loops over primitive, unboxed types.

                                It’s likely more productive to add appropriate strictness hints, awkward structure, and ugly code to make the optimizer happy than to glue two languages together via an ffi, especially since the ffi is likely to act as an optimization barrier, suppressing inlining and making alias analysis hard.

                                So, while idiomatic rust is likely to be faster than idiomatic haskell, warty optimized haskell in hotspots likely to be more productive than gluing things together by an ffi.