1. 21

Hey Lobsters!

I’m with Scout APM, and we’re working with Nate Berkopec to do some free, in-depth performance audits of Rails open source projects. We’ve been talking with Peter (pushcx) about doing one for Lobste.rs, and we wanted to open it up to community discussion before going forward.

For us, this is a way to raise awareness of Rails performance issues and best practices, and to give back to the OSS community. We’re covering the cost of Nate’s time to work on these projects, and we’ll release the audit publicly to the community to give some direction around optimizing the codebase.

On the Lobste.rs side, we’ll submit a PR adding our monitoring gem to the codebase and will set up our config file with Peter independently. After the audit, only Lobste.rs admins will have access to the account, unless there is community interest in making that data public.

After about a week of gathering data, Nate will do the performance audit. We’re going to record a screencast of him going through the codebase and debugging performance issues, and will also write up a document with notes and suggested next steps for optimization.

Regarding privacy, here are a few things to know about Scout:

  • We don’t send form data from an application back up to our servers unless it is explicitly added to the trace by calling something like Context.add(secret: params[:secret]), which it won’t be.
  • If there’s potentially private info in query params (/foo?private_thing=secret) - the ?… part can get excluded, which we are happy to enable – docs here.
  • By default, Scout collects the URI and IP of a traced request. However for the sake of user privacy, we’ll set the parameter to disable collection of remote IPs in the config – docs here.

We’ll keep this post open for a week for discussion, so let us know if you have any thoughts or questions!

Cheers, Casey


  2. 9

    This seemed useful for everyone involved, so I invited them to run this by folks, especially for privacy concerns. We’ve got known slow spots like story submission (5-10 sec!?) and puzzles that don’t quite hurt enough to grab dev attention, so there’s probably also some unknown unknowns that we could fix. Also I just randomly met Nate at a conference yesterday and I expect good results from his analysis.

    1. 3

      known slow spots like story submission

      I’ve pushed a fix for this.

      1. 3

        Conveniently, privacy analysis is my area of professional expertise. I’ll try to find time to take a close look at this. I encourage community members to also look at it; don’t rely on me to catch everything. Privacy is always about answering questions in the form “what’s missing from this picture?”, and more eyes do help.

        1. 1

          Just curious, whatever happened to this?

          1. 2

            In the first week I found a bad bug that had cost us a big chunk of ram and some crashes. This also fed into the plan to use ElasticSearch that we’ve changed to try Postgresql’s full-text search instead (which also gets us recursive common table expressions, a handy feature for comment threads).

            The report from Berkopec is still pending. We’ve had some back-and-forth about why a few queries look the way they do and last I heard in October he wanted some extra data from Scout’s new Timeline view. I’m sending a ping on it now…

            1. 1

              Thanks for the update! Looking forward to the results.

          2. 1

            I’ve seen the story submission issues too, and think this would be neat.

            Will we be removing the hooks once the study is concluded?

            1. 2

              Yeah, I don’t expect to see a compelling reason to leave Scout integrated indefinitely.

          3. 6

            First off, thank you for this offer. Like pushcx, I think this should be up to the community as a whole, but I personally think it’s a good idea. In the interests of transparency and education, I’ll ask my privacy questions here where everyone can see them.

            1. Does your system send just one ping to your server page load, or on what occasions does it send pings?

            2. What information is contained in each ping (other than the basic http headers, and the URI which you’ve already addressed)?

            3. Some user-agent strings are highly identifying, while others are not. Since your methodology involves pings directly from end users’ browsers to your server, you will necessarily receive raw user-agent strings. Do you store these? If so, do you do anything to anonymize them?

            4. It sounds like the answer is no, but does your tool set any cookies? If so, for what purposes? Does it make use of the existing Lobste.rs login cookie in any way?

            5. It sounds like your tool works by including JavaScript in the page. Is this correct? Is the JS code served from lobste.rs, from your own domain, or from a third-party CDN?

            6. Does your tool load any third-party resources other than its JS code and the pings?

            7. To what domain or domains are the pings sent? Who owns these domains? Who owns the servers for them, and where are the servers located? Are the domains served through a CDN or other third-party provider?

            8. Do pings always use https, or are there cases where they fall back to http?

            This is everything I can think of off the top of my head; I don’t know that I won’t have more questions later. I appreciate your engaging, I know I’m asking for a lot of details.

            1. 4

              Hi, I’m an engineer at Scout and can answer these questions.

              I think that most of your questions can be answered by a clarification of how Scout works. We instrument Ruby and Rails at the server level. We do not inject javascript or run anything else in the browser.

              1. The agent that is running on the server inside the Rails process reports once a minute to our checkin endpoint.
              2. We send aggregate data about all endpoints (Rails controller actions). And a handful of more detailed traces. The detailed traces are timings of a single request. We sanitize any SQL removing all parameters and have options (mentioned in the OP) to further restrict private or semi-private data. (IPs, User email as reported by Rails’s current_user, query parameters). These traces do not capture POST data, cookies, or headers.
              3. We don’t capture user agents
              4. We don’t set any cookies
              5. No, no javascript is added, and pages are not changed.
              6. Same as 5, no editing of the pages as they appear without Scout
              7. We host on AWS, and there are no CDNs in the checkin process. Other than using AWS’s Load Balancer and RDS offerings, the rest is on self-managed EC2 servers.
              8. The checkins between the Rails server and our checkin endpoint are always https.

              Hopefully that clears things up a bit, I’d be happy to further clarify.

              1. 2

                That all helps a great deal, thank you very much. I realized belatedly that I should have asked whether you did the JS thing at all. :)

                Regarding your point 2, I do have a couple further questions.

                1. What is the aggregation key on this data? I’m imagining something along the lines of bucketized counts, like “537 requests spent < 20ms in this part of the code”?

                2. Also, in the more detailed traces, can you give a sample of what information each row in the trace contains? I’m guessing the columns are “timestamp, function name” or something like that?

                3. In general, do you use any absolute timestamps anywhere in what’s sent to the server, or is it all relative to the start of the query?

                1. 3
                  1. The aggregation the timing and allocation counts per-endpoint, per instrument.
                  • Endpoints are a Rails Controller + Action (“UserController#index”)
                  • Instruments are ActiveRecord, View Rendering, HTTP, etc.

                  But not any more detail than that.

                  We also store some metrics outside the specific instruments (95th percentile time, memory change over the request)

                  1. Detailed traces keep up the instruments split up into each call into them. The clearest example is the screenshot on our marketing page.

                  A trace has some metadata, although some of this will be turned off in configuration options:

                  • The IP of the user, the email of the user. These will be turned off in this installation.
                  • The full URI requested, although any query params can optionally be excluded if there is private information present.
                  • Seconds since application server start. This is relative to the specific unicorn or puma handler.
                  • The Git Revision of the code deployed at the time of the trace
                  • Any custom context explicitly added. Useful for things like: “Number of posts rendered” and similar application specific data that help debug speed issues. This is added explicitly via API calls ScoutApm::Context.add("post_count", p)

                  Instrument data:

                  • “12 SQL calls doing User#find, with this sanitized SQL, for a total of 583ms, with timestamps [[start,stop], [start,stop]] originating at this line of code”.
                  • “Rendering users/index.html, which in turn rendered users/_user.html 100 times”

                  We do store and send absolute timestamps of many of these items. Is this a concern?

                  More generally, we aren’t being creepy about data, and only collect what’s useful to locate, and debug performance issues. We have found a small amount of user identifying info is helpful to debug (“Chris’s view of this page is super slow… oh, its because he has 1000 things, when normal user has 10” ends up being a common issue).

                  The user data we collect is very similar to whats in the Rails logs.

                  I’m not sure what the final plan is to share the resulting data with the whole community. Nate along with Scout will evaluate and write up findings, and probably a screencast. I’m not sure if the plan is to open up a more widely available way to poke around the data in a freeform manner.

                  1. 2

                    I would generally suggest from a privacy perspective that timestamps are sensitive in that they have a tendency to create unwanted joins against data from other sources, which can be used (for example) to re-identify individuals even if no user identifiers are present.

                    For an example of how this would work, somebody with access to your trace logs could find one that’s for the “post comment” endpoint, and then look through the lobste.rs comment history, which is sparse enough that it’s quite likely they’d be able to figure out which comment it was. Then, if any other data stored as part of your trace revealed sensitive information about the user who posted the comment, the attacker could know who that information applied to.

                    I don’t think this is a real attack in this scenario, because I can’t come up with a way that your trace logs could have sensitive information - mostly because lobste.rs itself does not have much sensitive information, other than IPs and passwords. The analysis would be a bit simpler if your timestamps were relative rather than absolute, but I don’t think it needs to be changed.

                    With regard to the plan to share the detailed data with the community afterward, I do think that would create a greater risk of somebody discovering an attack that I haven’t thought of. It’s probably possible with some care, or you could just share reports rather than the raw data. I think how much effort it’s worth investing to clean it up for release depends on how valuable you think that release is likely to be.

                    I appreciate your very thorough answers, and for whatever it’s worth this has my personal approval to go forward. Of course, we should wait for the rest of the community to weigh in as well.

              2. 3

                Regarding #5 — APM, as I recall, is a lot like newrelic, in that instumentation of the language VM is done (kind of like profiling hooks). This allows the vendor to get/report on function level timings and often memory usage introspection.

                I dont believe there are any frontend/js things at all. I could be wrong though…that may have become an addon since last I checked.

              3. 5

                This is a kind offer.

                Alexa tells me that the median load time for lobste.rs is 0.687 seconds, helpfully adding that “91% of sites are slower.” It would be outstanding if that number got lower as an ultimate result of your measurements and optimization suggestions–the measurement is aggregated across all page requests that make their way in to Alexa.

                Users have reported to me that story submission is slow, sometimes on the order of 5-10 seconds. I would hope there is an easier win there–that’s an unconscionable amount of time.

                My own performance pain point is the comments page. We have some idea of what we could do there architecturally to speed it up–but I’ll be personally interested in seeing what you find there.

                I’ve also had one user report for slowness that turned out to coincide with the backups running. I wonder if you’ll be able to see that?

                On privacy issues, I need to study a little more: confidential information is stored in the database and modified by requests, including upvotes and direct messages. I may as a result come back with a list of endpoints to ignore if necessary to preserve that confidentiality. By the same token I appreciate your proactively disabling remote IP collection.

                I’m looking forward to your screencast.

                1. 4

                  Has the site been getting slower? I seem to think so, for past few months. More load, some random gem update?

                  1. 4

                    Submissions and sometimes comments have been slow for a long time. Search can be ultra slow or just drop but it might be doing a lot of work. Just a submission or comment should be consistently fast unless the site is overloaded at that moment. So, I’d be grateful for anyone knocking out bottlenecks or inconsistencies.

                    1. 4

                      Search currently uses mariadb’s full-text indexing. Though now that I say that, I should defrag it and run optimize table on every table, so I just did. We’ll see if that helps and if should set a regular calendar reminder for it. (Surely we have a mariadb admin with an opinion on this?)

                      1. 1

                        I appreciate any attempts to improve it.

                    2. 3

                      I haven’t added more memory to the DomU in some time. I’ve needed to roughly in proportion to traffic/user growth on the site, and spot checking after your comment here suggests it would help. We did just do fairly systemic gem updates as a result of some CVEs that landed middle of this month. Past few months exceeds that time frame though.

                      Let me add more memory no later than this weekend and see where that leaves us.

                      1. 1

                        I’ve added 2GiB of memory. Earlier this week @pushcx performed the database defrag mentioned in this thread. Since I was doing some housekeeping I have reenabled the DRBD secondary disk, so the block device is again being mirrored to another physical host in an adjacent rack. DRBD secondary has been offline since late January 2019 when I changed the physical host lobste.rs was running on. I don’t think the state of DRBD helped or hurt performance, but the physical host change better correspond to your timeline than the March gem update I reported on earlier–the former and current physical hosts should have identical specs but the load may now be materially different.

                        I’m happy to investigate whether that’s the case and migrate the DomU if I find that might help. Will you let me know in the next little while if your experience of the site’s performance is different now that we’re running more memory and have run a database defrag?

                        1. 2

                          It does seem a little better. I’m also overseas right now, which adds a bunch of latency, but even so it seems faster than last week.

                          1. 1

                            I’ve spot checked memory usage throughout the day, and after traffic picked up this morning the machine has had >8GiB allocated the vast majority of that time. The weekend upgrade adding 2GiB took the machine from 8GiB to 10GiB, meaning we’ve used more memory during the day than the machine had previously. Confirming both what I saw in the logs last week and explaining some of the slowness over the past few months.

                            I’m kicking myself for not adding a couple more vCPUs at the same time–the very busiest times do briefly see us CPU bound. I’ll continue spot checking this week and will decide whether to add more memory or vCPU this weekend to give us more runway.

                            Thank you for bringing up the site performance and hanging in there with me to verify adding memory helped.

                      2. 1

                        Alexa tells me that the median load time for lobste.rs is 0.687 seconds

                        PR 674 (“we’ll submit a PR adding our monitoring gem to the codebase”) has now been submitted. Prior to that PR being merged, as a result of the discussion in this thread and the attendant increase in RAM, a material correction to the above statement: the median load time for lobste.rs, as reported by Alexa, is now 0.605 seconds.

                      3. 4

                        Hi lobste.rs. Long-time user myself as well. I am happy to answer any questions about this project.

                        Otherwise, looking forward to making the Lobsters experience better for everyone.

                        1. 2

                          Love this idea! Don’t see any downsides to this.

                          1. 2

                            I’ve deployed the PR and we should have enough data in a week or two. Found two bugs in our Ansible playbook along the way, to boot.

                            1. 3

                              After a lot of nudges and reminders, this project has stalled out as business priorities have shifted. Scout deactivated our account and I’ve removed the reporting plugin.

                              I did get some rough notes from this: in short, that we’re not really constrained on the Rails side. The two biggest area for improvement are:

                              1. The complexity of a few of our queries limits the database’s ability to make effective use of indexes, especially on CommentsController#index (and presumably similar queries). We may be able to tweak the db design to mitigate these: the is_deleted, is_moderated, and is_expired booleans have a lot of semantic overlap, or we could distill an ‘active’ comments table that does not include those records.
                              2. Many other queries are slow because of limit/offset paging and we could use seek pagination to mitigate; we could roll this into issue 394.

                              As always, I’d appreciate help with this development work, especially if anyone is an expert at mariadb performance profiling and query tuning. Unfortunately, since we updated the server to Ubuntu 20.04 LTS, we’ve seen perf regressions that we’re fairly sure are db queries being generally slower across the board. I’m currently working through a backlog of PRs and mod issues that built up while my schedule was badly disrupted by world events the last few months, and db performance looks like the next most important thing for me to take up afterwards if we haven’t had contributions towards it.

                              1. 1