1. 8
  1.  

  2. 4

    From the “Does this make me a bad person?” File:

    At some point our Heroku-hosted Ruby service gained a strange memory leak that yielded no timeouts.

    It was not a severe or rapid leak: Swapping would begin about 8-12 hours after app start, whether that was Heroku’s daily restart or a deploys from us. With such a long time-gap, we would only see it if someone glanced at the metrics dashboard at the beginning of the workday (no deploys having occurred yet). A bold warning would proclaim that:

    “In the last 24 hours, there have been 8043 critical errors for this app.”

    Which sounds scary, and would consisted of 8033 “R14 Memory Quota Exceeded” and 10 “H27 Client Request Interrupted”. Exact numbers fluctuated, but the general R14:H27 ratio remained. Zero timeout errors. Stranger still, the memory use would settle in at ~160% of available, or so we thought. It was never clear if this was a true steady state, pure luck, or incidental to the timings of restarts and deploys.

    The icing on the cake: No appreciable difference in response times before and after swapping began.
    Some basic profiling on core app features yielded but no obvious culprit, and again, no timeouts.

    So in a case of “If it ain’t broke, don’t fix it”, we let a few thousand memory errors roll in every day. For two years.

    In that time we eventually rebuilt most of the core product, but the leak remained. Our best guess was some outdated gem version used only by our then-sprawling-and-clunky admin reporting features.

    The leak eventually went away during a wave of gem upgrades (possibly validating our theory), but I still feel a haunting lack of closure at never fully understanding its source.

    1. 2

      As a consultant, I’ve investigated a few of these on different projects, here’s my toolkit for tracking these down:

      • Suspect every gem with C code in it. Audit your usage, run a test (with real data extracted from your real database) that repeats your calls a few thousand times. With a slow leak like this, it’s almost certainly not on a common code path, so prioritize gems + features used the least.
      • If you have any kind of repro, binary search gem upgrades/downgrades to dial it in. This is a tedious manual process.
      • Tinker with the integrated tests to run them repeatedly and watch for the leak on your local machine.
      • Collect logs, look for a correlation between a particular URL and crashes.
      • Replay production request logs against a staging server.
      • Add JS to error pages to tattle to the server/error tracking service that a request failed. If users aren’t ever reporting 500s, it’s probably some ajax polling.

      Sounds like you made a very smart business decision not to pursue this one to the ends of the earth, though.

    2. 4

      Hello