Hey fellow Lobsters!
I’m a grad student at MIT working on distributed systems, and have been developing a research prototype of a large-scale SQL-compatible database with built-in support for cache maintenance. It is intended for read-heavy applications, and already significantly outperforms regular databases and the common memcached+database setup.
For the purposes of evaluation of this system, it would be extremely useful to have a good idea of what a realistic workload looks like for a relatively read-heavy, but decently complex site such as Lobste.rs. To that end, I was hoping to get some insight into the Lobste.rs workload by running some analytics queries on the Lobste.rs database and web logs. @pushcx seemed open to this, but suggested that I post here first to solicit some feedback on the queries I want to run and any potential privacy implications they may have.
I realize it wouldn’t be okay to give out the raw data, but a rough idea of which page visit distribution (e.g., 84% front-page, 10% comment view, 5% upvote, 1% comment) popularity skew (data of #upvotes vs article ID, or views vs article ID for some period of time would be super helpful!), read-write ratio, etc. would go a long way towards building a reliable evaluation (and guiding future design of the system!). It could also be extremely useful to other researchers who may want to build solutions for “real” web applications.
First, we want to identify the distribution of upvotes across stories. Specifically, how many stories have various (binned) upvote counts. For example, this would tell us how many stories have thousands of votes, how many have 0-100 votes, etc. Note that we do not distinguish between upvotes and downvotes in this case, since they are the same as far as workload generation is concerned. The specific query we’d like to run against the Lobste.rs schema is:
SELECT ROUND(upvotes + downvotes, -2) AS bucket,
COUNT(id),
RPAD('', LN(COUNT(id)), '*')
FROM stories GROUP BY bucket
Similarly, we’d like to extract the comment count distribution:
SELECT ROUND(comments_count, -2) AS bucket,
COUNT(id),
RPAD('', LN(COUNT(id)), '*')
FROM stories GROUP BY bucket
These both use the technique outlined here to get an approximated histogram, which gives a good overview of the spread and distribution of popularity without giving all the raw data.
We also want the distribution of upvotes across users. Specifically, how many users have various (binned) upvote counts. This aids us in constructing a workload generator that faithfully models some users being more active than others. The rounding ensures that we only learn how many users have a number of votes in each bin 100*i
to 100*(i+1)
:
SELECT ROUND(counted.cnt, -2) AS bucket,
COUNT(user_id),
RPAD('', LN(COUNT(user_id)), '*')
FROM (
SELECT user_id, COUNT(id) AS cnt
FROM votes GROUP BY user_id
) AS counted
GROUP BY bucket
From the nginx logs, we’d like to run the following for however much of the log is okay (more is better):
grep -vE "assets|fetch_url_attributes|check_url_dupe" access.log
| sed -e 's/.*\(GET\|POST\)/\1/'
-e 's/ \/\(s\|stories\)\/[^\/]*/ \/stories\/X/'
-e '/^GET / s/X\/.*/X\/Y/'
-e 's/ \/comments\/[^\/]*/ \/comments\/X/'
| awk '/^(GET|POST)/ {print $1" "$2}'
| grep -E ' /(stories|comments|login|logout|$)'
| sort | uniq -c | sort -rnk1,1
Phew, that’s a mouthful. Let’s first see what it produces:
10 GET /
9 GET /stories/X/Y
4 POST /login
2 POST /stories/X/upvote
2 POST /stories
2 POST /comments/X/upvote
2 POST /comments/X/unvote
2 POST /comments
2 GET /comments
1 POST /stories/X/unvote
1 POST /logout
1 GET /login
1 GET /comments/X/reply
Basically, it looks for requests to “interesting” pages (that is, the frontpage, /stories
, /comments
, and /login
+ /logout
), masks any dynamic parameters, and the shows the number of requests to each of those resources. This gives us an idea of which types of pages are accessed more frequently, including whether they are generally reads or writes. If it feels too invasive to include absolute counts, percentages of total would also be fine.
We’d love to also learn the distribution of page popularities for both reads and writes, but we figured we’d leave that out of the initial analysis as it does leak a little more information. It’s also a little tricky to get right without just dumping all the data. We can approximate it using upvotes and # of comments, but the two aren’t necessarily perfectly correlated.
I’d love to hear what you all think about these, and whether you can think of any improvements. The ultimate goal here is to construct a workload generator that can simulate a semi-realistic request load for a Lobste.rs-like website. I think the data above should go a long way towards that goal, though I may have missed something.
Cheers, Jon
Hey folks,
Jon messaged me a day or two ago. I gave him the standard answer about these sorts of inquiries: I’m happy to run queries that don’t reveal personal info like IPs, browsing, and voting or create “worst-of” leaderboards celebrating most-downvoted users/comments/stories, that sort of thing. I can’t volunteer to write queries for people, but the schema is on GitHub and the logs are MySQL and nginx, so it’s straightforward to do.
A couple years ago jcs ran some queries for me and I wanted to continue that especially as the public stats that answered some popular questions have been gone for a while. It’s useful for transparency and because it’s just fun to investigate interesting questions. I’ve already run a few queries for folks in the chat room (the only I can remember off the top of my head is how many total stories have been submitted; we passed 40k last month).
I asked Jon to post publicly about this because it sounded like he had significantly more than one question he was idly curious about, to help spread awareness that I’ll run queries like this, and get the community’s thoughts on his queries and the general policy. I’ll add a note to the about page after this discussion.
I’m going offline for a couple hours for a prior commitment before I’ll have a chance to run any of these, but it’ll leave plenty of time for a discussion to get started or folks to think up their own queries to post as comments.
Wasn’t the concept of Differential Privacy developed to allow for exactly the purpose of querying databases containing personal data while maintaining as much privacy as possible? Maybe this could be employed?
In this particular case I don’t think the counts are actually sensitive, so it’s unclear that applying DP is even necessary. But I’ll ping Frank McSherry who’s one of the primary proponents of DP in academia nowadays and see what he thinks :) Maybe with DP we could extract what is arguably more sensitive information (e.g., by reducing the bin widths).
That seems doable, but I have the strong suspicion that if I wing it I’ll screw something up and leak personal info. So hopefully Frank can chime in with some good advice.
I’m here! :D I’m writing up some text in more of a long-form “blog post” format, to try and explain what is what without the constraints of trying to fit everything in-line here. But, some high-level points:
Operationally queries one and two are pretty easy to pull off with differential privacy (the “how many votes per article” and “how many votes per user” queries). I’ve got some code that does that, and depending on the scale you could even just use it, in principle (or if you only have a SQL interface to the logs, we may need to bang on them).
The third query is possibly not complicated, depending on my understanding of it. My sed-fu is weak, but to the extent that the query asks only for the counts of pre-enumerable strings (e.g.
POST /stories/X/upvote
) it should be good. If the query needs to discover what strings are important (e.g.POST /stories/X/*
) then there is a bit of a problem. It is still tractable, but perhaps more of a mess than you want to casually wade into.Probably the biggest question mark is about the privacy guarantee you want to provide. I understand that you have a relatively spartan privacy policy, which is all good, but clearly you have some interest in doing right by the users with respect to their sensitive data. The most casual privacy guarantee you can give is probably “mask the presence / absence of individual votes/views”, which would be “per log-record privacy”. You might want to provide a stronger guarantee of “mask the presence absence of entire individuals”, which could have a substantial deleterious impact on the analyses; I’m not super-clear on which guarantee you would prefer, or even the best rhetorical path to take to try and discover which one you prefer.
Anyhow, I’m typing things up right now and should have a post with example code, data, analyses, etc. pretty soon. At that point, it should be clearer to say “ah, well let’s just do X then” or “I didn’t realize Y; that’s fatal, unfortunately”.
EDIT: I’ve put up a preliminary version of a post under the idea that info sooner rather than later is more helpful. I’m afraid I got pretty excited about the first two questions and didn’t really do much about the third. The short version of the post is that one could probably release information that leads to pretty accurate distributional information about the multiplicities of votes, by articles and by users, without all of the binning. That could be handy as (elsewhere in the thread) it looks like binning coarsely kills much of the information. Take a read and I’ll iterate on the clarity of the post too.
To follow up briefly on this: yes, it would be useful to avoid the binning so that we could feed more data to whatever regression we end up using to approximate the underlying distribution.
For those who are curious, I’ve started implementing the workload generator here. It currently mostly does random requests, but once I have the statistics I’ll plug them in and it should generate more representative traffic patterns. It does require a minor [patch](https://github.com/jonhoo/trawler/blob/master/lobsters.diff to the upstream lobste.rs codebase, but that’s mostly to enable automation.
Shouldn’t this metadata be public (or, at least available to all lobste.rs members), or not given to any 3rd parties at all? If there’s a reason for it not being public, then there’s risk involved with giving it to 3rd parties. (Who may have best data security ever, but lose data for reasons beyond their control)
Personally I’d love to see stats that could be generated by scraping the site anyway available to everyone, all the time, it’s interesting to see these kinds of distributions and things.
Hit counts which are only available from web server logs seems more sensitive / private and less interesting to me.
Oh, sorry, I wasn’t implying that this data needed to be sent to me specifically; I’d be happy for the metrics to be posted somewhere public. I think there is an argument against making the entire db public though:
etc., One option would be to anonymize the database dump before posting it, but I think having some well-defined metrics defined instead will get us quite far with less of a potential for accidentally leaking something.
As for scraping the site, I completely agree with you. It would be super useful to get a dump of all stories + comments + vote counts for the purposes of re-constructing (or even just backing up) lobste.rs elsewhere.
In my particular case, hit counts are important for a couple of reasons:
What has to be factored in here is that what’s publicly on the site can be compared to what’s in the dump to deanonymize quite a bit of users. That a bit easier given a relatively-small number of commenters and submitters whose votes are already visible on main site.
Related to public backups of the site, a few months ago Archive Team scraped the entire site. I don’t know where to find the archive they made (maybe it all just went into the Wayback Machine?) or if they’re continuing to pick up new stories.
Results here. Thanks for your patience, @jonhoo, I was traveling.
nginx grep:
This log started 2018-02-11 04:40:31 UTC. I edited some tokens and login information out, replacing with
[elided]
.This is fantastic, thank you!
Looks like my log grepping wasn’t perfect, but we can clean up the remainder pretty easily. Feel free to edit out any lines from the nginx log with a count
< 10
. I posted a slightly updated command in a comment a few days ago that includes/recent
and/u
and makes some other fixes – if you could run that one (mostly for the/recent
numbers), that’d be awesome.It looks like the comment and count numbers are a little smaller than I expected, which makes the distribution hard to infer from the few resulting data points. Any chance you could re-run the first two commands with
ROUND(..., -1)
instead (rounds to the nearest 10 instead of nearest 100)?The updated nginx grep didn’t work. My sed (identifies itself as
sed (GNU sed) 4.4
) errored on three latter instances of//
. I took a stab at correcting it but couldn’t immediately spot what it should’ve been.And, sure, here’s the first two with the smaller buckets:
Huh, seems like something weird happened when I copied it. Here, try this one?
Oops, looks like I forgot
/recent
again in that query :( Could you run:And also let me know the start time of the log and the time you run the query (so I have an idea of the interval)?
The workload generator is also shaping up over here :)
Timestamp at the start is
11/Feb/2018:04:40:31 +0000
, end is27/Mar/2018:01:26:49 +0000
. I’m glad to see these stats going into your work, and I look forward to seeing your finished work submitted as a story.Yeah, the workload generator is shaping up pretty nicely! FWIW, it looks like the next scaling bottleneck lobste.rs is likely to experience is Ruby at ~100x current load on an 8-core server, followed by MySQL scaling of the transactional update to the traffic stats around ~2000x current load. So still quite a bit of headroom!
Also, I made (yet another) stupid with the query above: it now doesn’t include the frontpage, which is arguably quite important, due to a missed
|
. Try again? O:)Nice to hear we should have smooth sailing for some time.
Hehe, yeah, I think you should be good ;)
Hmm, now there’s no entry there for
GET /stories/X
? Did you perhaps miss the first line when copying? Looks like the data is all from the same logging period though, so I can just manually combine them :) Thanks!I also just realized that we’re missing the vote distribution for comments, which’d be super handy!
To everyone who’s curious about the results of this, and to @pushcx in particular, the paper is now published! https://jon.tsp.io/papers/osdi18-noria.pdf
Thanks for publishing it. This is a really, old thread, though. I suggest you submit it as its own story with authored by and a reference to this thread in text field. Then it will get more attention.
Yeah, @pushcx made the same suggestions privately. That was the plan all along, but I wanted to wait until we’d held the conference talk (which we just did!) and polished up the codebase + docs a little. Submitted it now as https://lobste.rs/s/uxgjha/noria_dynamic_partially_stateful_data !
It’s 3 AM here so I haven’t thought super hard about the data you’re suggesting retrieving, but offhand it seems okay. I do have one clarifying question; this:
seems to imply that the data might be shared. How? Will it be fully public? From what I see here I’m pretty sure I’d be fine with that, I’m just wondering :)
Also, as a side note, I assume you’ll submit your paper to Lobste.rs when you’re done ;)
Those performance results sound super interesting, checking out the links you posted is on my todo list now.
I was imagining that the results of running these queries would be posted publicly by @pushcx, or whomever ends up running them. If you or others are concerned about that, I could just use them directly myself, but I think it’d be advantageous to just make them public :)
Hehe, yes, the paper will appear here when it is (hopefully) published.
One thing I have been curious about is daily usage patterns for tags. If I submit something on a topic, I prefer to do it when most folks interested in it are reading. Otherwise, they miss out. Seems like some tags and days/times get different results than others but Im not sure. Id like to see patterns on:
Views by day and time for all threads.
Same for each tag in a way where we can also look at combinations.
Might be helpful.
From an sysadm pov, that kind of metrics are very interesting for monitoring purposes (I’d just add the latency with it).
In my previous and current jobs we’ve been using Prometheus for that and it’s SUPER useful. Although I recon that it’s probably too huge for just a Lobsters app.
It’s a little bit trickier to extract latency measurements from the nginx logs + the db without additional instrumentation. I agree with you though, those kinds of numbers would be very useful for system monitoring. For my purposes, they’re not particularly interesting, as I’d use the resulting workload generator to generate load and then measure system performance.
@pushcx seems I missed
/recent
and/u
in that list, which would be a little unfortunate. Here’s a fixed command: