Jon messaged me a day or two ago. I gave him the standard answer about these sorts of inquiries: I’m happy to run queries that don’t reveal personal info like IPs, browsing, and voting or create “worst-of” leaderboards celebrating most-downvoted users/comments/stories, that sort of thing. I can’t volunteer to write queries for people, but the schema is on GitHub and the logs are MySQL and nginx, so it’s straightforward to do.
A couple years ago jcs ran some queries for me and I wanted to continue that especially as the public stats that answered some popular questions have been gone for a while. It’s useful for transparency and because it’s just fun to investigate interesting questions. I’ve already run a few queries for folks in the chat room (the only I can remember off the top of my head is how many total stories have been submitted; we passed 40k last month).
I asked Jon to post publicly about this because it sounded like he had significantly more than one question he was idly curious about, to help spread awareness that I’ll run queries like this, and get the community’s thoughts on his queries and the general policy. I’ll add a note to the about page after this discussion.
I’m going offline for a couple hours for a prior commitment before I’ll have a chance to run any of these, but it’ll leave plenty of time for a discussion to get started or folks to think up their own queries to post as comments.
Wasn’t the concept of Differential Privacy developed to allow for exactly the purpose of querying databases containing personal data while maintaining as much privacy as possible? Maybe this could be employed?
In this particular case I don’t think the counts are actually sensitive, so it’s unclear that applying DP is even necessary. But I’ll ping Frank McSherry who’s one of the primary proponents of DP in academia nowadays and see what he thinks :) Maybe with DP we could extract what is arguably more sensitive information (e.g., by reducing the bin widths).
That seems doable, but I have the strong suspicion that if I wing it I’ll screw something up and leak personal info. So hopefully Frank can chime in with some good advice.
I’m here! :D I’m writing up some text in more of a long-form “blog post” format, to try and explain what is what without the constraints of trying to fit everything in-line here. But, some high-level points:
Operationally queries one and two are pretty easy to pull off with differential privacy (the “how many votes per article” and “how many votes per user” queries). I’ve got some code that does that, and depending on the scale you could even just use it, in principle (or if you only have a SQL interface to the logs, we may need to bang on them).
The third query is possibly not complicated, depending on my understanding of it. My sed-fu is weak, but to the extent that the query asks only for the counts of pre-enumerable strings (e.g. POST /stories/X/upvote) it should be good. If the query needs to discover what strings are important (e.g. POST /stories/X/*) then there is a bit of a problem. It is still tractable, but perhaps more of a mess than you want to casually wade into.
Anyhow, I’m typing things up right now and should have a post with example code, data, analyses, etc. pretty soon. At that point, it should be clearer to say “ah, well let’s just do X then” or “I didn’t realize Y; that’s fatal, unfortunately”.
EDIT: I’ve put up a preliminary version of a post under the idea that info sooner rather than later is more helpful. I’m afraid I got pretty excited about the first two questions and didn’t really do much about the third. The short version of the post is that one could probably release information that leads to pretty accurate distributional information about the multiplicities of votes, by articles and by users, without all of the binning. That could be handy as (elsewhere in the thread) it looks like binning coarsely kills much of the information. Take a read and I’ll iterate on the clarity of the post too.
To follow up briefly on this: yes, it would be useful to avoid the binning so that we could feed more data to whatever regression we end up using to approximate the underlying distribution.
For those who are curious, I’ve started implementing the workload generator here. It currently mostly does random requests, but once I have the statistics I’ll plug them in and it should generate more representative traffic patterns. It does require a minor [patch](https://github.com/jonhoo/trawler/blob/master/lobsters.diff to the upstream lobste.rs codebase, but that’s mostly to enable automation.