1. 35

Hey fellow Lobsters!

I’m a grad student at MIT working on distributed systems, and have been developing a research prototype of a large-scale SQL-compatible database with built-in support for cache maintenance. It is intended for read-heavy applications, and already significantly outperforms regular databases and the common memcached+database setup.

For the purposes of evaluation of this system, it would be extremely useful to have a good idea of what a realistic workload looks like for a relatively read-heavy, but decently complex site such as Lobste.rs. To that end, I was hoping to get some insight into the Lobste.rs workload by running some analytics queries on the Lobste.rs database and web logs. @pushcx seemed open to this, but suggested that I post here first to solicit some feedback on the queries I want to run and any potential privacy implications they may have.

I realize it wouldn’t be okay to give out the raw data, but a rough idea of which page visit distribution (e.g., 84% front-page, 10% comment view, 5% upvote, 1% comment) popularity skew (data of #upvotes vs article ID, or views vs article ID for some period of time would be super helpful!), read-write ratio, etc. would go a long way towards building a reliable evaluation (and guiding future design of the system!). It could also be extremely useful to other researchers who may want to build solutions for “real” web applications.

First, we want to identify the distribution of upvotes across stories. Specifically, how many stories have various (binned) upvote counts. For example, this would tell us how many stories have thousands of votes, how many have 0-100 votes, etc. Note that we do not distinguish between upvotes and downvotes in this case, since they are the same as far as workload generation is concerned. The specific query we’d like to run against the Lobste.rs schema is:

SELECT ROUND(upvotes + downvotes, -2) AS bucket,
       COUNT(id),
   RPAD('', LN(COUNT(id)), '*')
FROM stories GROUP BY bucket

Similarly, we’d like to extract the comment count distribution:

SELECT ROUND(comments_count, -2) AS bucket,
       COUNT(id),
   RPAD('', LN(COUNT(id)), '*')
FROM stories GROUP BY bucket

These both use the technique outlined here to get an approximated histogram, which gives a good overview of the spread and distribution of popularity without giving all the raw data.

We also want the distribution of upvotes across users. Specifically, how many users have various (binned) upvote counts. This aids us in constructing a workload generator that faithfully models some users being more active than others. The rounding ensures that we only learn how many users have a number of votes in each bin 100*i to 100*(i+1):

SELECT ROUND(counted.cnt, -2) AS bucket,
       COUNT(user_id),
   RPAD('', LN(COUNT(user_id)), '*')
FROM (
    SELECT user_id, COUNT(id) AS cnt
FROM votes GROUP BY user_id
) AS counted
GROUP BY bucket

From the nginx logs, we’d like to run the following for however much of the log is okay (more is better):

grep -vE "assets|fetch_url_attributes|check_url_dupe" access.log
 | sed -e 's/.*\(GET\|POST\)/\1/'
       -e 's/ \/\(s\|stories\)\/[^\/]*/ \/stories\/X/'
       -e '/^GET / s/X\/.*/X\/Y/'
   -e 's/ \/comments\/[^\/]*/ \/comments\/X/'
 | awk '/^(GET|POST)/ {print $1" "$2}'
 | grep -E ' /(stories|comments|login|logout|$)'
 | sort | uniq -c | sort -rnk1,1

Phew, that’s a mouthful. Let’s first see what it produces:

 10 GET /
  9 GET /stories/X/Y
  4 POST /login
  2 POST /stories/X/upvote
  2 POST /stories
  2 POST /comments/X/upvote
  2 POST /comments/X/unvote
  2 POST /comments
  2 GET /comments
  1 POST /stories/X/unvote
  1 POST /logout
  1 GET /login
  1 GET /comments/X/reply

Basically, it looks for requests to “interesting” pages (that is, the frontpage, /stories, /comments, and /login + /logout), masks any dynamic parameters, and the shows the number of requests to each of those resources. This gives us an idea of which types of pages are accessed more frequently, including whether they are generally reads or writes. If it feels too invasive to include absolute counts, percentages of total would also be fine.

We’d love to also learn the distribution of page popularities for both reads and writes, but we figured we’d leave that out of the initial analysis as it does leak a little more information. It’s also a little tricky to get right without just dumping all the data. We can approximate it using upvotes and # of comments, but the two aren’t necessarily perfectly correlated.

I’d love to hear what you all think about these, and whether you can think of any improvements. The ultimate goal here is to construct a workload generator that can simulate a semi-realistic request load for a Lobste.rs-like website. I think the data above should go a long way towards that goal, though I may have missed something.

Cheers, Jon

  1. 18

    Hey folks,

    Jon messaged me a day or two ago. I gave him the standard answer about these sorts of inquiries: I’m happy to run queries that don’t reveal personal info like IPs, browsing, and voting or create “worst-of” leaderboards celebrating most-downvoted users/comments/stories, that sort of thing. I can’t volunteer to write queries for people, but the schema is on GitHub and the logs are MySQL and nginx, so it’s straightforward to do.

    A couple years ago jcs ran some queries for me and I wanted to continue that especially as the public stats that answered some popular questions have been gone for a while. It’s useful for transparency and because it’s just fun to investigate interesting questions. I’ve already run a few queries for folks in the chat room (the only I can remember off the top of my head is how many total stories have been submitted; we passed 40k last month).

    I asked Jon to post publicly about this because it sounded like he had significantly more than one question he was idly curious about, to help spread awareness that I’ll run queries like this, and get the community’s thoughts on his queries and the general policy. I’ll add a note to the about page after this discussion.

    I’m going offline for a couple hours for a prior commitment before I’ll have a chance to run any of these, but it’ll leave plenty of time for a discussion to get started or folks to think up their own queries to post as comments.

    1. 4

      Wasn’t the concept of Differential Privacy developed to allow for exactly the purpose of querying databases containing personal data while maintaining as much privacy as possible? Maybe this could be employed?

      1. 5

        In this particular case I don’t think the counts are actually sensitive, so it’s unclear that applying DP is even necessary. But I’ll ping Frank McSherry who’s one of the primary proponents of DP in academia nowadays and see what he thinks :) Maybe with DP we could extract what is arguably more sensitive information (e.g., by reducing the bin widths).

        1. 2

          That seems doable, but I have the strong suspicion that if I wing it I’ll screw something up and leak personal info. So hopefully Frank can chime in with some good advice.

          1. 6

            I’m here! :D I’m writing up some text in more of a long-form “blog post” format, to try and explain what is what without the constraints of trying to fit everything in-line here. But, some high-level points:

            1. Operationally queries one and two are pretty easy to pull off with differential privacy (the “how many votes per article” and “how many votes per user” queries). I’ve got some code that does that, and depending on the scale you could even just use it, in principle (or if you only have a SQL interface to the logs, we may need to bang on them).

            2. The third query is possibly not complicated, depending on my understanding of it. My sed-fu is weak, but to the extent that the query asks only for the counts of pre-enumerable strings (e.g. POST /stories/X/upvote) it should be good. If the query needs to discover what strings are important (e.g. POST /stories/X/*) then there is a bit of a problem. It is still tractable, but perhaps more of a mess than you want to casually wade into.

            3. Probably the biggest question mark is about the privacy guarantee you want to provide. I understand that you have a relatively spartan privacy policy, which is all good, but clearly you have some interest in doing right by the users with respect to their sensitive data. The most casual privacy guarantee you can give is probably “mask the presence / absence of individual votes/views”, which would be “per log-record privacy”. You might want to provide a stronger guarantee of “mask the presence absence of entire individuals”, which could have a substantial deleterious impact on the analyses; I’m not super-clear on which guarantee you would prefer, or even the best rhetorical path to take to try and discover which one you prefer.

            Anyhow, I’m typing things up right now and should have a post with example code, data, analyses, etc. pretty soon. At that point, it should be clearer to say “ah, well let’s just do X then” or “I didn’t realize Y; that’s fatal, unfortunately”.

            EDIT: I’ve put up a preliminary version of a post under the idea that info sooner rather than later is more helpful. I’m afraid I got pretty excited about the first two questions and didn’t really do much about the third. The short version of the post is that one could probably release information that leads to pretty accurate distributional information about the multiplicities of votes, by articles and by users, without all of the binning. That could be handy as (elsewhere in the thread) it looks like binning coarsely kills much of the information. Take a read and I’ll iterate on the clarity of the post too.

            1. 2

              To follow up briefly on this: yes, it would be useful to avoid the binning so that we could feed more data to whatever regression we end up using to approximate the underlying distribution.

        2. 2

          For those who are curious, I’ve started implementing the workload generator here. It currently mostly does random requests, but once I have the statistics I’ll plug them in and it should generate more representative traffic patterns. It does require a minor [patch](https://github.com/jonhoo/trawler/blob/master/lobsters.diff to the upstream lobste.rs codebase, but that’s mostly to enable automation.

        3. 6

          Shouldn’t this metadata be public (or, at least available to all lobste.rs members), or not given to any 3rd parties at all? If there’s a reason for it not being public, then there’s risk involved with giving it to 3rd parties. (Who may have best data security ever, but lose data for reasons beyond their control)

          Personally I’d love to see stats that could be generated by scraping the site anyway available to everyone, all the time, it’s interesting to see these kinds of distributions and things.

          Hit counts which are only available from web server logs seems more sensitive / private and less interesting to me.

          1. 4

            Oh, sorry, I wasn’t implying that this data needed to be sent to me specifically; I’d be happy for the metrics to be posted somewhere public. I think there is an argument against making the entire db public though:

            • Password hashes shouldn’t be public
            • User e-mail addresses shouldn’t be public
            • Who voted for/against what shouldn’t be public
            • “Hidden” lists shouldn’t be public
            • Private messages shouldn’t be public

            etc., One option would be to anonymize the database dump before posting it, but I think having some well-defined metrics defined instead will get us quite far with less of a potential for accidentally leaking something.

            As for scraping the site, I completely agree with you. It would be super useful to get a dump of all stories + comments + vote counts for the purposes of re-constructing (or even just backing up) lobste.rs elsewhere.

            In my particular case, hit counts are important for a couple of reasons:

            • It shows which pages are more frequently accessed, which in turn tells me how often different queries are accessed.
            • If story IDs aren’t masked out, it tells us the popularity distribution of stories (i.e., key skew in database terms).
            • It tells us access patterns over time – how quickly does the popularity of particular stories wax and wane.
            1. 7

              One option would be to anonymize the database dump before posting it

              What has to be factored in here is that what’s publicly on the site can be compared to what’s in the dump to deanonymize quite a bit of users. That a bit easier given a relatively-small number of commenters and submitters whose votes are already visible on main site.

              1. 4

                Related to public backups of the site, a few months ago Archive Team scraped the entire site. I don’t know where to find the archive they made (maybe it all just went into the Wayback Machine?) or if they’re continuing to pick up new stories.

            2. 6

              Results here. Thanks for your patience, @jonhoo, I was traveling.

              SELECT ROUND(upvotes + downvotes, -2) AS bucket, COUNT(id), RPAD('', LN(COUNT(id)), '*') FROM stories GROUP BY bucket
              
              +--------+-----------+------------------------------+
              | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
              +--------+-----------+------------------------------+
              |      0 |     40286 | ***********                  |
              |    100 |       360 | ******                       |
              |    200 |         2 | *                            |
              |    300 |         2 | *                            |
              +--------+-----------+------------------------------+
              4 rows in set (0.30 sec)
              

              SELECT ROUND(comments_count, -2) AS bucket, COUNT(id), RPAD(’’, LN(COUNT(id)), ‘*’) FROM stories GROUP BY bucket;

              +--------+-----------+------------------------------+
              | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
              +--------+-----------+------------------------------+
              |      0 |     40482 | ***********                  |
              |    100 |       165 | *****                        |
              |    200 |         3 | *                            |
              +--------+-----------+------------------------------+
              3 rows in set (0.29 sec)
              

              SELECT ROUND(counted.cnt, -2) AS bucket, COUNT(user_id), RPAD(’’, LN(COUNT(user_id)), ‘*’) FROM ( SELECT user_id, COUNT(id) AS cnt FROM votes GROUP BY user_id) AS counted GROUP BY bucket;

              +--------+----------------+-----------------------------------+
              | bucket | COUNT(user_id) | RPAD('', LN(COUNT(user_id)), '*') |
              +--------+----------------+-----------------------------------+
              |      0 |           4439 | ********                          |
              |    100 |            562 | ******                            |
              |    200 |            206 | *****                             |
              |    300 |            131 | *****                             |
              |    400 |             92 | *****                             |
              |    500 |             45 | ****                              |
              |    600 |             28 | ***                               |
              |    700 |             31 | ***                               |
              |    800 |             29 | ***                               |
              |    900 |             32 | ***                               |
              |   1000 |             14 | ***                               |
              |   1100 |             23 | ***                               |
              |   1200 |             18 | ***                               |
              |   1300 |             18 | ***                               |
              |   1400 |             13 | ***                               |
              |   1500 |             10 | **                                |
              |   1600 |              3 | *                                 |
              |   1700 |              7 | **                                |
              |   1800 |              8 | **                                |
              |   1900 |              6 | **                                |
              |   2000 |              3 | *                                 |
              |   2100 |              3 | *                                 |
              |   2200 |              4 | *                                 |
              |   2300 |              6 | **                                |
              |   2400 |              4 | *                                 |
              |   2500 |              2 | *                                 |
              |   2600 |              3 | *                                 |
              |   2700 |              2 | *                                 |
              |   2800 |              2 | *                                 |
              |   2900 |              3 | *                                 |
              |   3000 |              2 | *                                 |
              |   3100 |              3 | *                                 |
              |   3200 |              2 | *                                 |
              |   3300 |              2 | *                                 |
              |   3400 |              1 |                                   |
              |   3600 |              1 |                                   |
              |   3800 |              1 |                                   |
              |   3900 |              1 |                                   |
              |   4000 |              1 |                                   |
              |   4100 |              3 | *                                 |
              |   4200 |              1 |                                   |
              |   4300 |              6 | **                                |
              |   4400 |              1 |                                   |
              |   4500 |              2 | *                                 |
              |   4600 |              1 |                                   |
              |   4800 |              1 |                                   |
              |   5000 |              1 |                                   |
              |   5100 |              1 |                                   |
              |   5400 |              1 |                                   |
              |   5500 |              1 |                                   |
              |   5800 |              2 | *                                 |
              |   5900 |              1 |                                   |
              |   6000 |              1 |                                   |
              |   6400 |              2 | *                                 |
              |   6800 |              1 |                                   |
              |   7100 |              1 |                                   |
              |   7700 |              1 |                                   |
              |   8400 |              1 |                                   |
              |   8700 |              1 |                                   |
              |   8900 |              1 |                                   |
              |  10100 |              1 |                                   |
              |  10900 |              1 |                                   |
              |  13500 |              1 |                                   |
              |  14500 |              1 |                                   |
              |  15100 |              1 |                                   |
              +--------+----------------+-----------------------------------+
              65 rows in set (0.15 sec)
              

              nginx grep:

              This log started 2018-02-11 04:40:31 UTC. I edited some tokens and login information out, replacing with [elided].

              1133494 GET /stories/X/Y
               590907 GET /
                71829 GET /comments
                19150 GET /comments.rss
                12020 POST /comments/X/upvote
                 9033 POST /stories/X/upvote
                 6126 POST /comments
                 4591 GET /login
                 3681 GET /comments/X/reply
                 1660 POST /login
                 1429 GET /comments/X/edit
                 1301 POST /comments/X/2.0"
                 1174 POST /stories/X/save
                 1019 POST /stories
                  929 POST /stories/X/hide
                  637 POST /comments/X/downvote
                  535 GET /comments/X/2
                  390 POST /stories/X/2.0"
                  224 POST /comments/X/unvote
                  202 GET /comments/X/3
                  198 GET /comments/X/2.0"
                  190 POST /stories/X/downvote
                  174 GET /login/forgot_password
                  150 POST /stories/X/suggest
                  148 POST /stories/X/unvote
                  148 GET /login/2fa
                  132 POST /login/2fa_verify
                  109 POST /login/reset_password
                   96 GET /comments/X/4
                   94 POST /stories/X/unsave
                   63 POST /comments/X/delete
                   55 POST /logout
                   54 GET /comments/X/5
                   37 POST /stories/X/unhide
                   34 GET /comments.rss?token=[elided]
                   29 GET /comments/X/6
                   27 POST /login/set_new_password
                   19 POST /comments/X/1.1"
                   19 GET /comments/X/7
                   19 GET /comments/X/1.1"
                   12 POST /
                   12 GET /comments/X/3170
                   10 POST /comments/X/undelete
                    9 GET /comments/X/8
                    8 GET /comments/X/9
                    8 GET /comments/X/10
                    7 GET /logout
                    5 GET /login/reset_password
                    5 GET /comments/X/11
                    4 GET /comments/X/
                    2 POST /stories/X/1.1"
                    2 POST /comments?RzVI%3D3496%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23
                    2 GET /comments/X/5000
                    2 GET /comments/X/12
                    1 POST /stories/X/do_we_need_move_away_from_elm
                    1 POST /login?utf8=%E2%9C%93&authenticity_token=[elided]&email=[elided]&password=[elided]&commit=Login&referer=https%3A%2F%2Flobste.rs%2Fs%2Fcgqz3p%2Fdo_we_need_move_away_from_elm
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /login/set_new_password?token=[elided]
                    1 GET /comments/X/7000
                    1 GET /comments/X/6000
                    1 GET /comments/X/5700
                    1 GET /comments/X/5600
                    1 GET /comments/X/5570
                    1 GET /comments/X/5559
                    1 GET /comments/X/5558
                    1 GET /comments/X/5555
                    1 GET /comments/X/5550
                    1 GET /comments/X/5500
                    1 GET /comments/X/20000
                    1 GET /comments/X/200
                    1 GET /comments/X/20
                    1 GET /comments/X/14
                    1 GET /comments/X/13
                    1 GET /comments/X/10000
                    1 GET /comments.rss?token=[elided]
              
              1. 1

                This is fantastic, thank you!

                Looks like my log grepping wasn’t perfect, but we can clean up the remainder pretty easily. Feel free to edit out any lines from the nginx log with a count < 10. I posted a slightly updated command in a comment a few days ago that includes /recent and /u and makes some other fixes – if you could run that one (mostly for the /recent numbers), that’d be awesome.

                It looks like the comment and count numbers are a little smaller than I expected, which makes the distribution hard to infer from the few resulting data points. Any chance you could re-run the first two commands with ROUND(..., -1) instead (rounds to the nearest 10 instead of nearest 100)?

                1. 2

                  The updated nginx grep didn’t work. My sed (identifies itself as sed (GNU sed) 4.4) errored on three latter instances of //. I took a stab at correcting it but couldn’t immediately spot what it should’ve been.

                  And, sure, here’s the first two with the smaller buckets:

                  +--------+-----------+------------------------------+
                  | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                  +--------+-----------+------------------------------+
                  |      0 |     16724 | **********                   |
                  |     10 |     16393 | **********                   |
                  |     20 |      4601 | ********                     |
                  |     30 |      1707 | *******                      |
                  |     40 |       680 | *******                      |
                  |     50 |       281 | ******                       |
                  |     60 |       128 | *****                        |
                  |     70 |        60 | ****                         |
                  |     80 |        35 | ****                         |
                  |     90 |        16 | ***                          |
                  |    100 |         4 | *                            |
                  |    110 |         4 | *                            |
                  |    120 |        10 | **                           |
                  |    130 |         1 |                              |
                  |    140 |         2 | *                            |
                  |    160 |         1 |                              |
                  |    210 |         1 |                              |
                  |    250 |         1 |                              |
                  |    290 |         1 |                              |
                  +--------+-----------+------------------------------+
                  19 rows in set (0.32 sec)
                  
                  
                  +--------+-----------+------------------------------+
                  | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                  +--------+-----------+------------------------------+
                  |      0 |     33974 | **********                   |
                  |     10 |      4831 | ********                     |
                  |     20 |      1029 | *******                      |
                  |     30 |       401 | ******                       |
                  |     40 |       193 | *****                        |
                  |     50 |       103 | *****                        |
                  |     60 |        50 | ****                         |
                  |     70 |        24 | ***                          |
                  |     80 |        18 | ***                          |
                  |     90 |        10 | **                           |
                  |    100 |         5 | **                           |
                  |    110 |         1 |                              |
                  |    120 |         5 | **                           |
                  |    130 |         2 | *                            |
                  |    140 |         1 |                              |
                  |    150 |         2 | *                            |
                  |    170 |         1 |                              |
                  +--------+-----------+------------------------------+
                  17 rows in set (0.31 sec)
                  
                  1. 1

                    Huh, seems like something weird happened when I copied it. Here, try this one?

                    grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password" lobsters.log
                     | sed -e 's/.*\(GET\|POST\)/\1/'
                           -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                           -e 's/\/u\/[^\/ ]*/\/u\/X/'
                           -e '/^GET / s/X\/.\+/X/'
                           -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                     | awk '/^(GET|POST)/ {print $1" "$2}'
                     | grep -E ' /(stories|comments|login|logout|u/|$)'
                     | sort | uniq -c | sort -rnk1,1
                    
                    1. 2
                      1139580 GET /stories/X
                       594323 GET /
                       159959 GET /u/X
                        72320 GET /comments
                        19287 GET /comments.rss
                        12088 POST /comments/X/upvote
                         9081 POST /stories/X/upvote
                         6160 POST /comments
                         4597 GET /login
                         3704 GET /comments/X/reply
                         1663 POST /login
                         1433 GET /comments/X/edit
                         1322 POST /comments/X
                         1188 POST /stories/X/save
                         1022 POST /stories
                          932 POST /stories/X/hide
                          641 POST /comments/X/downvote
                          542 GET /comments/X/2
                          395 POST /stories/X
                          224 POST /comments/X/unvote
                          219 GET /comments/X
                          207 GET /comments/X/3
                          192 POST /stories/X/downvote
                          153 GET /login/2fa
                          150 POST /stories/X/suggest
                          149 POST /stories/X/unvote
                          137 POST /login/2fa_verify
                          101 GET /comments/X/4
                           94 POST /stories/X/unsave
                           63 POST /comments/X/delete
                           55 POST /logout
                           55 GET /comments/X/5
                           37 POST /stories/X/unhide
                           34 GET /comments.rss?token=[elided]
                           29 GET /comments/X/6
                           19 GET /comments/X/7
                           12 POST /
                           12 GET /comments/X/3170
                           10 POST /comments/X/undelete
                      
                      1. 1

                        Oops, looks like I forgot /recent again in that query :( Could you run:

                        grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password|token" lobsters.log
                         | sed -e 's/.*\(GET\|POST\)/\1/'
                               -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                               -e 's/\/u\/[^\/ ]*/\/u\/X/'
                               -e '/^GET / s/X\/.\+/X/'
                               -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                         | awk '/^(GET|POST)/ {print $1" "$2}'
                         | grep -E ' /(stories|comments|login|logout|u/|recent$)'
                         | sort | uniq -c | sort -rnk1,1
                        

                        And also let me know the start time of the log and the time you run the query (so I have an idea of the interval)?

                        The workload generator is also shaping up over here :)

                        1. 2

                          Timestamp at the start is 11/Feb/2018:04:40:31 +0000, end is 27/Mar/2018:01:26:49 +0000. I’m glad to see these stats going into your work, and I look forward to seeing your finished work submitted as a story.

                          1615623 GET /stories/X
                           193915 GET /u/X
                           105680 GET /comments
                            27731 GET /comments.rss
                            23155 GET /recent
                            18235 POST /comments/X/upvote
                            13730 POST /stories/X/upvote
                             9141 POST /comments
                             7137 GET /login
                             5572 GET /comments/X/reply
                             2512 POST /login
                             2233 GET /comments/X/edit
                             2055 POST /comments/X
                             1735 POST /stories/X/save
                             1522 POST /stories
                             1422 POST /stories/X/hide
                             1159 POST /comments/X/downvote
                              801 GET /comments/X/2
                              600 POST /stories/X
                              391 POST /comments/X/unvote
                              347 POST /stories/X/downvote
                              340 GET /comments/X
                              314 GET /comments/X/3
                              261 POST /stories/X/unvote
                              212 GET /login/2fa
                              203 POST /stories/X/suggest
                              191 POST /login/2fa_verify
                              156 GET /comments/X/4
                              125 POST /stories/X/unsave
                               99 POST /comments/X/delete
                               94 POST /logout
                               79 GET /comments/X/5
                               60 POST /stories/X/unhide
                               45 GET /comments/X/6
                               30 GET /comments/X/7
                               15 POST /comments/X/undelete
                               14 GET /comments/X/8
                               13 GET /comments/X/9
                               12 GET /comments/X/3170
                               12 GET /comments/X/10
                                9 GET /comments/X/
                          
                          1. 3

                            Yeah, the workload generator is shaping up pretty nicely! FWIW, it looks like the next scaling bottleneck lobste.rs is likely to experience is Ruby at ~100x current load on an 8-core server, followed by MySQL scaling of the transactional update to the traffic stats around ~2000x current load. So still quite a bit of headroom!

                            Also, I made (yet another) stupid with the query above: it now doesn’t include the frontpage, which is arguably quite important, due to a missed |. Try again? O:)

                            grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password|token" lobsters.log
                             | sed -e 's/.*\(GET\|POST\)/\1/'
                                   -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                                   -e 's/\/u\/[^\/ ]*/\/u\/X/'
                                   -e '/^GET / s/X\/.\+/X/'
                                   -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                             | awk '/^(GET|POST)/ {print $1" "$2}'
                             | grep -E ' /(stories|comments|login|logout|u/|recent|$)'
                             | sort | uniq -c | sort -rnk1,1
                            
                            1. 2

                              Nice to hear we should have smooth sailing for some time.

                               870987 GET /
                               193915 GET /u/X
                               105680 GET /comments
                                27731 GET /comments.rss
                                23155 GET /recent
                                18235 POST /comments/X/upvote
                                13730 POST /stories/X/upvote
                                 9141 POST /comments
                                 7137 GET /login
                                 5572 GET /comments/X/reply
                                 2512 POST /login
                                 2233 GET /comments/X/edit
                                 2055 POST /comments/X
                                 1735 POST /stories/X/save
                                 1574 GET /recent/page/2
                                 1522 POST /stories
                                 1422 POST /stories/X/hide
                                 1159 POST /comments/X/downvote
                                  801 GET /comments/X/2
                                  606 GET /recent/page/3
                                  600 POST /stories/X
                                  391 POST /comments/X/unvote
                                  347 POST /stories/X/downvote
                                  340 GET /comments/X
                                  316 GET /recent/page/4
                                  314 GET /comments/X/3
                                  261 POST /stories/X/unvote
                                  212 GET /login/2fa
                                  203 POST /stories/X/suggest
                                  191 POST /login/2fa_verify
                                  187 GET /recent/page/5
                                  156 GET /comments/X/4
                                  125 POST /stories/X/unsave
                                  125 GET /recent/page/6
                                  110 GET /recent/
                                   99 POST /comments/X/delete
                                   98 GET /recent/page/7
                                   94 POST /logout
                                   83 GET /recent/page/8
                                   79 GET /comments/X/5
                                   73 GET /recent/page/9
                                   60 POST /stories/X/unhide
                                   59 GET /recent/page/10
                                   47 GET /recent/page/11
                                   45 GET /comments/X/6
                                   31 GET /recent/page/13
                                   30 GET /recent/page/22
                                   30 GET /comments/X/7
                                   29 GET /recent/page/14
                                   28 GET /recent/page/33
                                   28 GET /recent/page/21
                                   27 GET /recent/page/28
                                   26 GET /recent/page/12
                                   25 POST /
                                   25 GET /recent/page/24
                                   25 GET /recent/page/23
                                   25 GET /recent/page/18
                                   25 GET /recent/page/17
                                   25 GET /recent/page/115
                                   24 GET /recent/page/40
                                   24 GET /recent/page/27
                                   24 GET /recent/page/19
                                   24 GET /recent/page/15
                                   22 GET /recent/page/80
                                   22 GET /recent/page/60
                                   22 GET /recent/page/16
                                   21 GET /recent/page/37
                                   21 GET /recent/page/25
                                   20 GET /recent/page/95
                                   20 GET /recent/page/31
                                   19 GET /recent/page/96
                                   19 GET /recent/page/39
                                   19 GET /recent/page/30
                                   19 GET /recent/page/112
                                   19 GET /recent/page/104
                                   18 GET /recent/page/35
                                   18 GET /recent/page/20
                                   18 GET /recent/page/134
                                   17 GET /recent/page/42
                                   16 GET /recent/page/67
                                   16 GET /recent/page/124
                                   15 POST /comments/X/undelete
                                   15 GET /recent/page/92
                                   15 GET /recent/page/84
                                   15 GET /recent/page/77
                                   15 GET /recent/page/68
                                   15 GET /recent/page/66
                                   15 GET /recent/page/57
                                   15 GET /recent/page/36
                                   15 GET /recent/page/131
                                   15 GET /recent/page/125
                                   15 GET /recent/page/122
                                   15 GET /recent/page/121
                                   15 GET /recent/page/116
                                   15 GET /recent/page/114
                                   15 GET /recent/page/101
                                   15 GET /recent/page/100
                                   14 GET /recent/page/87
                                   14 GET /recent/page/59
                                   14 GET /recent/page/45
                                   14 GET /comments/X/8
                                   13 GET /recent/page/98
                                   13 GET /recent/page/82
                                   13 GET /recent/page/51
                                   13 GET /recent/page/135
                                   13 GET /recent/page/133
                                   13 GET /recent/page/128
                                   13 GET /recent/page/120
                                   13 GET /recent/page/108
                                   13 GET /comments/X/9
                                   12 GET /recent/page/94
                                   12 GET /recent/page/93
                                   12 GET /recent/page/89
                                   12 GET /recent/page/69
                                   12 GET /recent/page/65
                                   12 GET /recent/page/53
                                   12 GET /recent/page/52
                                   12 GET /recent/page/49
                                   12 GET /recent/page/46
                                   12 GET /recent/page/38
                                   12 GET /recent/page/34
                                   12 GET /recent/page/136
                                   12 GET /recent/page/132
                                   12 GET /recent/page/130
                                   12 GET /recent/page/117
                                   12 GET /recent/page/102
                                   12 GET /recent/page/
                                   12 GET /comments/X/3170
                                   12 GET /comments/X/10
                                   11 GET /recent/page/90
                                   11 GET /recent/page/88
                                   11 GET /recent/page/79
                                   11 GET /recent/page/71
                                   11 GET /recent/page/64
                                   11 GET /recent/page/129
                                   11 GET /recent/page/123
                                   11 GET /recent/page/118
                                   11 GET /recent/page/109
                                   11 GET /recent/page/105
                                   10 GET /recent/page/85
                                   10 GET /recent/page/78
                                   10 GET /recent/page/74
                                   10 GET /recent/page/70
                                   10 GET /recent/page/63
                                   10 GET /recent/page/61
                                   10 GET /recent/page/55
                                   10 GET /recent/page/43
                                   10 GET /recent/page/41
                                   10 GET /recent/page/26
                                   10 GET /recent/page/126
                                   10 GET /recent/page/119
                              
                              1. 1

                                Hehe, yeah, I think you should be good ;)

                                Hmm, now there’s no entry there for GET /stories/X? Did you perhaps miss the first line when copying? Looks like the data is all from the same logging period though, so I can just manually combine them :) Thanks!

                        2. 1

                          I also just realized that we’re missing the vote distribution for comments, which’d be super handy!

                          SELECT ROUND(counted.cnt, -1) AS bucket,
                                 COUNT(comment_id),
                                 RPAD('', LN(COUNT(comment_id)), '*')
                          FROM (
                              SELECT comment_id, COUNT(id) AS cnt
                              FROM votes WHERE comment_id IS NOT NULL
                              GROUP BY comment_id
                          ) AS counted
                          GROUP BY bucket
                          
                          1. 2
                            +--------+-------------------+--------------------------------------+
                            | bucket | COUNT(comment_id) | RPAD('', LN(COUNT(comment_id)), '*') |
                            +--------+-------------------+--------------------------------------+
                            |      0 |             89907 | ***********                          |
                            |     10 |             27614 | **********                           |
                            |     20 |              2825 | ********                             |
                            |     30 |               637 | ******                               |
                            |     40 |               177 | *****                                |
                            |     50 |                61 | ****                                 |
                            |     60 |                29 | ***                                  |
                            |     70 |                 6 | **                                   |
                            |     80 |                 5 | **                                   |
                            |     90 |                 5 | **                                   |
                            |    100 |                 3 | *                                    |
                            |    150 |                 1 |                                      |
                            +--------+-------------------+--------------------------------------+
                            
                2. 5

                  To everyone who’s curious about the results of this, and to @pushcx in particular, the paper is now published! https://jon.tsp.io/papers/osdi18-noria.pdf

                  1. 1

                    Thanks for publishing it. This is a really, old thread, though. I suggest you submit it as its own story with authored by and a reference to this thread in text field. Then it will get more attention.

                    1. 2

                      Yeah, @pushcx made the same suggestions privately. That was the plan all along, but I wanted to wait until we’d held the conference talk (which we just did!) and polished up the codebase + docs a little. Submitted it now as https://lobste.rs/s/uxgjha/noria_dynamic_partially_stateful_data !

                  2. 5

                    It’s 3 AM here so I haven’t thought super hard about the data you’re suggesting retrieving, but offhand it seems okay. I do have one clarifying question; this:

                    It could also be extremely useful to other researchers who may want to build solutions for “real” web applications.

                    seems to imply that the data might be shared. How? Will it be fully public? From what I see here I’m pretty sure I’d be fine with that, I’m just wondering :)

                    Also, as a side note, I assume you’ll submit your paper to Lobste.rs when you’re done ;)

                    Those performance results sound super interesting, checking out the links you posted is on my todo list now.

                    1. 5

                      I was imagining that the results of running these queries would be posted publicly by @pushcx, or whomever ends up running them. If you or others are concerned about that, I could just use them directly myself, but I think it’d be advantageous to just make them public :)

                      Hehe, yes, the paper will appear here when it is (hopefully) published.

                    2. 5

                      One thing I have been curious about is daily usage patterns for tags. If I submit something on a topic, I prefer to do it when most folks interested in it are reading. Otherwise, they miss out. Seems like some tags and days/times get different results than others but Im not sure. Id like to see patterns on:

                      1. Views by day and time for all threads.

                      2. Same for each tag in a way where we can also look at combinations.

                      Might be helpful.

                      1. 4

                        From an sysadm pov, that kind of metrics are very interesting for monitoring purposes (I’d just add the latency with it).

                        1. You can quickly see when there is an issue after a deployment (or something else). For example if the number of upvote/hour is at 0 since your last deployment.
                        2. You can see which routes are the slowest/more used so you can improve them
                        3. See when a new deployment kills the performances
                        4. Correlate with other metrics …

                        In my previous and current jobs we’ve been using Prometheus for that and it’s SUPER useful. Although I recon that it’s probably too huge for just a Lobsters app.

                        1. 1

                          It’s a little bit trickier to extract latency measurements from the nginx logs + the db without additional instrumentation. I agree with you though, those kinds of numbers would be very useful for system monitoring. For my purposes, they’re not particularly interesting, as I’d use the resulting workload generator to generate load and then measure system performance.

                        2. 1

                          @pushcx seems I missed /recent and /u in that list, which would be a little unfortunate. Here’s a fixed command:

                          grep -vE "assets|fetch_url_attributes|check_url_dupe" lobsters.log
                           | sed -e 's/.*\(GET\|POST\)/\1/'
                                 -e 's//\(s\|stories\)\/[^\/ ]*//stories\/X/'
                                 -e 's//u\/[^\/ ]*//u\/X/'
                                 -e '/^GET / s/X\/.\+/X/'
                             -e 's//comments\/[^\/ ]*//comments\/X/'
                           | awk '/^(GET|POST)/ {print $1" "$2}'
                           | grep -E ' /(stories|comments|login|logout|u/|$)'
                           | sort | uniq -c | sort -rnk1,1