Threads for jonhoo

  1. 6

    There’s (for once) some decent discussion over at the orange site.

    1. 7

      Yeah, because this stuff belongs on the orange site. Keep Lobsters weird.

    1. 3

      Nice talk, but the mic constantly auto-leveling was very distracting. oof.

      1. 5

        Ugh, yeah, I’m aware. I tried running it through a normalizer, but to no avail.

        1. 7

          It sounds a bit weird but running it through an expander (to get quiet parts as loud as the louder parts) followed by a really aggressive compressor (really high ratio, to smooth out volume spikes) might do the trick.

          1. 9

            I know this is getting off topic, and maybe this is a weird question, but: how did you learn the knowledge required to make a suggestion like this? I’ve tried somewhat half heartedly learning more about audio, but always wind up baffled by the complexity of the interface for software tools in this domain.

            Tangentially, which tools would you personally use to implement your suggestion?

            1. 19

              There are plenty of people who are way more knowledgeable than me at this, but I’m happy to share what I know. I’m actually going to answer your questions out of order because one of the somewhat less interesting answers ended up being really long.

              Lots of this ended up being a ramble to get my thoughts out, but if you’ve got other questions, you’re welcome to ask them.

              Also, reasoning behind the expander/compressor (actually might be better to just severely boost the input signal and add a compressor, maybe with a limiter as well, but the effect will be similar) rather than a normalizer is because normalizers generally operate on the whole track at once (It’ll bring the highest peak up to a specific volume) while a compressor will operate on a slice of audio at once. Anyway the issue is with the dynamics, so the goal is to tweak them to make it better.

              EDIT: after playing with it, just boost the input signal and add a really heavy compressor with a limiter. The expander was actually not a good idea.

              always wind up baffled by the complexity of the interface for software tools in this domain.

              This was a huge problem for me starting off. It’s very hard to explain over text, but I think this is something you can only really learn by experimenting (though if you’re specifically looking for something to learn with, Mixing Audio: Concepts, Practices, and Tools and Mixing With Your Mind are both very good books for learning mixing theory). Most software tools end up being pretty close to their analog counterparts. DAWs usually have a set of faders and an EQ, with a place to add additional plugins.

              Same as software development, start with using only the portions you know and start adding more and more complexity as you learn it. Levels/gain and EQ should probably be first, followed by compression. Once you learn that, there’s a ton of other stuff out there.

              which tools would you personally use to implement your suggestion?

              In general I like using Logic Pro X. The built-in plugins are pretty good quality wise and it’s reasonably easy to use. Reaper would be another recommendation, especially if you need something cheaper and/or on multiple platforms.

              This one’s a little different because it deals with video as well. I’m not a fan of Final Cut Pro overall, but I’m pretty sure there’s a way to use that for the video and deal with the audio in Logic Pro X.

              I know open source tools exist (I think Ardour is the main one), but I haven’t found any that can measure up to the same quality of Logic or Pro Tools (or even Reaper). If you’re looking for something smaller to play around with, Reaper is fantastic, cheaper, and has really good platform support. I’ve tried moving back to Linux for my laptop (I ran the Linux Users Group in my college and it’s still something I’m passionate about) but because the quality of audio tools (or at least the audio tools I know - there’s also pro tools on Windows, but I’m not a fan of that either) is so much better on MacOS, I’m kind of stuck here.

              how did you learn the knowledge required to make a suggestion like this?

              1. Experimenting and playing around. Back in high school, I used to make super low quality recordings of a band I was in at the time - this consisted of hanging a microphone from a hook in my parents’ basement and manually setting levels by moving it closer or farther away from some instruments. This eventually moved on to getting a small mixer and managing separate microphones but recording the mixed down version rather than separate components.
              2. I never ended up finishing it (because I just couldn’t finish Music Theory II) but I have about 2/3 of a “Music Technology” minor (which is essentially Sound Engineering lite - I say lite specifically because there’s a lot of stuff I don’t know that a full Sound Engineer or Mixing Engineer would) along with my Software Engineering degree. Specifically, there was a class where every week or two they gave us a bunch of raw tracks and told us what to watch out for (this class was incredibly helpful - to be honest a lot of it was learning hacks to make things work). We had to come back with a rough mix.
              3. Experimenting and playing around. After college, I helped run sound for a few churches. It’s not always the most interesting (and I know plenty people aren’t a fan of churches in general - if you have ethical issues with this, there are other options, but they’re harder to find), but you learn quite a bit when you just have to make something work.
              4. For the last 2 to 3 years or so I’ve also been helping run live sound at a swing dance every week. This started off as just loaning some equipment I had lying around but turned into me filling in whenever other people couldn’t make it and eventually being one of the main 2 people who run sound there. I tend to be more interested in recordings where I can tweak things after the fact, so I try to focus on live multi-track recordings and the other people tend to focus on live sound.
              1. 8

                Really interesting! Thanks for the lovely response. I’ve added those two books to my list, so if I ever decide to dive back into this, I’ll have a better place to start. :-)

          2. 1

            Fyi: I’ve downloaded an mp4 of the video, threw it inside kdenlive, split audio and added “AutoCorrection->Normalize”. Now it’s equally loud, except a little too loud :P

            1. 2

              That’s interesting — when I tried normalizing, it didn’t help at all, probably due to what was discussed in the other comment thread about normalizing usually being applied uniformly across the entire audio. Maybe kdenlive does something smarter?

              Also FYI, I now have a lavalier mic that I’ll use for future recordings \o/

        1. 8

          Extra relevant to Lobsters was that @jonhoo (Jon Gjengset) also tested Noria on Lobsters app for a 5x speed-up. He also submitted a podcast and a video on it.

          1. 4

            Thanks for the mention! All the data for the Lobsters part of the paper came out of this thread.

            1. 3

              Hehe, this is actually how I found out about lobste.rs. I agree— excellent video! ^^

              1. 2

                Excellent video, I highly recommend it. Thanks for sharing it @nickpsecurity!

              1. 3

                This is pretty much exactly the kind of stuff we covered in the “data wrangling” lecture in our lecture series on programmer tools for anyone who’d like to learn more about this kind of “magic” :)

                1. 2

                  Hey @jonhoo,

                  If its not too much trouble, do you mind linking to our section on getting invited? We have been seeing an uptick in the number of flustered invitation seekers recently and I think it would help.

                  1. 1

                    Hey! Where were you thinking I would link to that from? There’s nothing in 6.HT that references Lobsters beyond the link to “see more discussion here” I think?

                    1. 1

                      Would a bullet below / an emdash next to the lobster.rs link work?

                      1. 1

                        I guess I’m confused about why exactly people following that link would need invitations? The links there are intended for people who want to read the comments that were made at time of posting. I supposed I could add something like (invite-only) with a link to that page, but that may have the opposite effect of what you’re after, since it might stop people from clicking onto Lobsters in the first place?

                        1. 2

                          Do you have any objections to me making a pull request on the site with the change this evening? I think that would clear up all of your confusion.

                          1. 2

                            Since your post did so well (congrats!) that the prominent mention of Lobsters led to a wave of folks dropping by the irc channel looking for invites. There’s a big overlap between people who want to watch videos and read about being better developers and people who want to join a community that’s largely about that same topic.

                            Rather than cause for celebration at the great PR, it’s been really frustrating for both them and us because it’s clear they haven’t read /about#invitations or otherwise been on the site long enough t oacculturate at all: they’re confused about why they have to ask for an invite and how to do so politely; the channel regulars feel swamped by rude, demanding randos. Slowing down onboarding is a feature rather than a bug of the invite system, but you sent enough attention our way that the deliberately-informal non-process is breaking down.

                            I think @355E3B asked for the link in the same spirit as you publishing these (great!) notes: there’s a wealth of powerful information and tools available to developers that it’s not obvious to go look for, so even a small pointer to what’s possible goes a long way. We’d love to have these folks participate in the community and linking to the invite explanation might remove the stumbling block.

                            (Also: wearing my online marketing hat, maybe the page’s link to the video and notes more prominent? It doesn’t pop out when skimming, so the bullet list of links at the bottom is the most actionable thing a skimmer sees. Easiest thing to do would be to make the text “all lecture videos and lecture notes” the link so that it’s visually bigger, or add a button for design contrast. Or just pull the contents of /lectures into this page; I don’t think it would make it too long.)

                            1. 2

                              Ah, I see. I wasn’t aware that the effect had been that strong! I’ve added a link now — let me know if the text is along the lines of what you were thinking? I’ve also reorganized the front page a little to make it easier to spot the link to lecture contents. I’ll chat to the others about putting the lecture list on the front page.

                              1. 1

                                All of that looks great to me.

                    1. 3

                      There’s also some good discussion over at Hacker News.

                      1. 2

                        It’s a good article and a useful starting point, I would like to see follow ups with analysis and approaches when one or more of the security hardening flags are enabled.

                        In particular it may not be possible to exploit the example vulnerable program at all with modern security hardening added, programmers I talk to who have not studied exploitation do not really seem to understand this. All buffer overflows are thought of as the end of the world. It may need a more complex interactive vulnerable binary from which we can first extract secrets then build an exploit payload.

                        1. 3

                          I wrote this as an accompanying text to MIT’s 6.858 Computer Security lab on buffer overflows, and in that lab we actually do have the students pull off a similar attack without -z execstack. Address randomization is harder, but doable through, for example, user-controlled format strings (though of course, there are also compiler checks that help mitigate that). The efficacy of stack canaries really depends on what kind of stack canaries are used, and the mechanism of attack. Terminator canaries for example are often fixed, and given a good attack vector you may be able to just blow right through them. Similarly, if you’re not doing just a classic buffer overflow, but something like a dangling pointer attack, then the canaries don’t help at all. D_FORTIFY_SOURCE similarly only helps when the compiler can guess appropriate bounds.

                          Overall, I think I’d say that if all compiler mechanisms are turned on, and the code is well written, it’s pretty darn hard to pull off an exploit. However, it’s so rare that both of those are true, especially in older software, and that’s why we end up with so many working exploits. In some sense, this is due to simple math: the attacker only needs one vulnerability, whereas the developer needs to ensure there are no vulnerabilities.

                            1. 1

                              Ah, yes, that looks like another interesting take on how to defeat stack canaries! Thanks for sharing.

                          1. 3

                            A couple minor corrections.

                            gcc -g -fno-stack-protector -z execstack vulnerable.c
                            

                            should be

                            gcc -g -fno-stack-protector -z execstack vulnerable.c -o vulnerable
                            
                            No­tice how it says rip at 0x7fffffffed18? That’s the ad­dress of the stored re­turn ad­dress!
                            

                            value should be 68 not 18.


                            When I tried running it I got this:

                            $ python3 exploit.py | env - setarch x86_64 -R ./vulnerable
                            *** buffer overflow detected ***: ./vulnerable terminated
                            ls
                            Traceback (most recent call last):
                              File "exploit.py", line 30, in <module>
                                fp.flush()
                            BrokenPipeError: [Errno 32] Broken pipe
                            Aborted
                            

                            I’m on nixos. Adding the flag -D_FORTIFY_SOURCE=0 fixed it and the exploit works.

                            1. 3

                              Good catches, thanks! Fixed in https://github.com/jonhoo/thesquareplanet.com/commit/21347af0e2cd27692b96159c8dab695ed10415e6. I’ll also add a paragraph on -D_FORTIFY_SOURCE.

                            1. 14

                              My feelings are kind of mixed so far. The lightweight UI and responsive site are a breath of fresh air. What’s a little jarring is how much of the service is centered around email. I’ve never been part of a mailing list, and emailing code to other people sounds like something from 20 years ago, but maybe I’m just a young whippersnapper that doesn’t know what he’s talking about. Git is already a complicated tool, and adding email to the mix just increases the cognitive load. I’ll still learn how to use it because it sounds kind of interesting, but my preference would still be some kind of browser interface.

                              1. 19

                                I think you should give email a chance. Git has built-in tools for working with email and you can do your entire workflow effectively without ever leaving your terminal. Sending a patch along for feedback is just git send-email -1. (-1 meaning one commit, -2 meaning the last 2 commits, etc). Here’s a guide, which is admittedly terse and slated to be replaced with a more accessible tutorial:

                                https://man.sr.ht/git.sr.ht/send-email.md

                                That being said, web tools are planned to seamlessly integrate with this workflow from a browser.

                                1. 11

                                  That being said, web tools are planned to seamlessly integrate with this workflow from a browser.

                                  I would use that.

                                  1. 4

                                    I like the email workflow, but I also have to be realistic - it is unlikely that my colleagues or drive-by contributors would adopt it. So, in practice it will mean fewer contributions and less cooperation.

                                    The GitHub-like workflow is something that is ingrained now and has a relatively low barrier to entry. So, if something is going to take over, it’s something that is very similar, such as GitLab or Gitea.

                                    Of course, there will always be projects that cater to an audience that feels at home with an email workflow.

                                    It’s good to hear that there will be web tools as well.

                                    1. 4

                                      I like the email workflow, but I also have to be realistic - it is unlikely that my colleagues or drive-by contributors would adopt it.

                                      I think as this workflow proliferates, this will become less and less true. It’s remarkably easy to make a drive-by contribution with email if you already have git send-email working, easier even than drive-by GitHub pull requests.

                                      1. 4

                                        git send-email working

                                        that’s a big ask

                                        git send-email needs a bunch of perl packages, which often means you need to set up perl packaging.

                                        Depending on your distro/OS this can be tricky, especially because git send-email needs a bunch of network packages and they don’t always cleanly install and you have to figure out why (except you don’t know much about perl packaging, so you can’t).

                                        There have been multiple cases on different OSes (i think osx and some version of ubuntu) that i gave up after half an hour of various cpan commands trying to get things to work. I’m not even going to try setting that up on Windows.

                                        Furthermore, the UX of git-send-email is terrible. Sending followup patches is annoying, for one.

                                        All this has forced me to try and paste patches directly into an email client. But this is broken, too. GMail, for one, converts tabs to spaces in plaintext emails, breaking patches. I could use a local client, but setting up a client well is a lot of work and confusing (i could also rant for a while about why this is the case, but i won’t) and I don’t really want to switch my workflow over to using a client.

                                        Furthermore, half the patch mailing lists I’ve worked with have hard-to-figure-out moderation rules. They’ll outright reject some kinds of emails without telling you, and because many are human moderated it’s hard to know if your email setup worked especially if you’re using git-send-email (which you may not have invoked or set up correctly) because 90% of the time your patch won’t show up on the list and you have no idea which of the many possible reasons for that is the case.

                                        Despite all this I’ve submitted quite a few patches to a patch mailing list (hell, I’ve been involved enough in mailing-list-based project to have commit), either by lucking out on the perl setup for send-email, by temporarily setting up a client that doesn’t sync, or by sending patches through gmail with “ignore the whitespace please, that’s gmail’s fault, I’ll fix it when I commit”. It’s a chore each time.

                                        I’ve given email multiple chances. It doesn’t work. The activation energy of email for patch contributions is quite high.


                                        The web UI thing sounds like a good idea, especially if it can handle replies. It’s basically what I’ve been suggesting projects on mailing lists do for ages. Use email all you want, just give me a way to interact with the project that doesn’t involve setting up email.

                                        1. 2

                                          Almost no one has to package git’s perl dependencies themselves. Doesn’t your OS have a package for it already? And as someone who has packaged git before, it wasn’t really that bad.

                                          Also, the golden rule of emailing patches is never paste them into some other mail client.

                                          1. 2

                                            Also, the golden rule of emailing patches is never paste them into some other mail client.

                                            Paste not, but maybe attach? FreeBSD don’t like it, but it’s OK for Postgres.

                                            1. 2

                                              I generally prefer that people don’t attach patches, either. IMO the best way to send patches is send-email.

                                              1. 1

                                                “IMO” and “the best” is perfectly fine. But I was under impression that it was unconditionally the only way to submit patches, when I wanted to improve sr.ht’s PG DB schemas.

                                                1. 2

                                                  Each project on sr.ht can have its own policies about how to accept patches. sr.ht itself does require you to submit them with send-email (i.e. for patches to the open source sr.ht repositories).

                                                  1. 1

                                                    Can you elaborate on what you dislike about sending patches with a normal MUA? It’s certainly a lot easier for someone who has spent the time to configure their MUA to be able to re-use the config they’ve already got rather than configuring a new tool they’ve never used before.

                                                    1. 3

                                                      The main issue is that nearly all MUAs will mangle your email and break your patches, which is annoying for the people receiving them and will be more work for you in the long run. Also, most end-user MUAs encourage the use of HTML emails, which are strictly forbidden on sr.ht. Also, code review usually happens by quoting your patch, trimming the fat, and replying inline. This is more annoying if you attach your patch to the email.

                                                      Setting up git send-email is pretty easy and will work every time thereafter. It’s also extremely convenient and fits rather nicely into the git workflow in general.

                                                      1. 1

                                                        I see; so it has more to do with the fact that you can’t trust most popular MUAs not to screw up the patch rather than any inherent problem with that flow. For a well-behaved MUA it should be fine, but assuming a MUA is well-behaved (or even assuming that a user knows whether theirs is or not) isn’t a good bet.

                                                        Thanks.

                                            2. 1

                                              Almost no one has to package git’s perl dependencies themselves. Doesn’t your OS have a package for it already?

                                              No, i don’t mean you have to package them, but you have to install them and the installation isn’t always smooth. It’s been a while since I did this so I don’t remember the precise issues but i think it has a lot to do with the TLS part of the net stack. Which kinda makes sense, openssl packaging/linking has issues pretty much everywhere (especially on OSX).

                                              Also, again, Windows. A lot of devs use Windows. I got involved in open source on Windows, back when I didn’t have my own computer. I could use Git and Github, but I’m pretty sure I’d have been unable to set up git-send-email if I had to at the time. Probably can now, but I’m an experienced programmer now.

                                              Also, the golden rule of emailing patches is never paste them into some other mail client.

                                              I know, except:

                                              • now handling replies is annoying
                                              • now i need to set up git-send-email, which doesn’t always work
                                              1. 1

                                                Windows devs aren’t in the target audience. I heard from a macOS user that they were able to get send-email working without too much trouble recently, maybe the situation has improved.

                                                now handling replies is annoying

                                                Not really?

                                                now i need to set up git-send-email, which doesn’t always work

                                                git send-email will always work if your email provider supports SMTP, which pretty much all of them do.

                                                1. 1

                                                  Windows devs aren’t in the target audience

                                                  If you’re wishing for email to be the future, you’re going to have to think about windows devs at some point.

                                                  (this choice is also even more hostile to new programmers, as if patch email workflows weren’t newbie-hostile enough already)

                                                  Not really?

                                                  You have to copy message ids and stuff to get lists to thread things properly

                                                  git send-email will always work

                                                  I just told you why it doesn’t always work :)

                                                  1. 3

                                                    I’m prepared to lose the Windows audience outright on sr.ht. Simple as that.

                                                    (edit)

                                                    Regarding message IDs, lists.sr.ht (and many other email archives) have a mailto: link that pre-populates the in-reply-to for you.

                                                    1. 1

                                                      I’m prepared to lose the Windows audience outright on sr.ht. Simple as that.

                                                      oh, sure, for your own tool it’s fine.

                                                      what I’m saying is that if you’re expecting this workflow to proliferate you will have to deal with this too.

                                                      Do whatever you want with your own tool: I’m just explaining why send-email proliferating is a tall order, and windows is a major draw here.

                                                      Regarding message IDs, lists.sr.ht (and many other email archives) have a mailto: link that pre-populates the in-reply-to for you.

                                                      ah, that’s nice. I may not have encountered lists with this (or been interacting only by email and not using the archive)

                                            3. 1

                                              git send-email needs a bunch of perl packages, which often means you need to set up perl packaging.

                                              Personally I’ve never once seen a unix machine where the perl stack wasn’t already installed for unrelated system-level stuff.

                                              1. 1

                                                to be clear, perl is usually installed, it’s the relevant packages (specifically, the networking/TLS stuff) that usually aren’t

                                                this is particularly bad on OSX which has its own openssl issues, so the Perl SSL packages refuse to compile

                                            4. 4

                                              if you already have git send-email working

                                              Sadly, I think this is extremely uncommon :’(

                                              1. 3

                                                Hence:

                                                I think as this workflow proliferates

                                                1. 5

                                                  I think I’d phrase this as if it proliferates, as if anything I think the number of people with sendmail (or equivalent) working on their computer is going down, not up. It’d be fun to see it rise again due to sr.ht, though I don’t know that I’m optimistic. But perhaps I’m just being overly pessimistic :)

                                                  I do worry about more casual developers though, who may not even really know how to use the command-line. I think an increasing number of developers interact with version control solely through their IDE, and only touch their command-line if they have to copy-paste some commands. It’d be interesting to see if that’s something this workflow can still cater to. Some simple web-based tooling may go a long way there!

                                                  1. 3

                                                    You don’t need to set up sendmail, you just need a mail server with SMTP - which nearly all of them support.

                                                    1. 3

                                                      Sorry, what I meant was more that you have to set up git for e-mail sending. I happen to have sendmail already set up, so all I needed was git config --global sendemail.smtpserver "/usr/bin/msmtp", but I think it’s very uncommon to already have it set up, or to even be comfortable following the instructions on https://man.sr.ht/git.sr.ht/send-email.md.

                                            5. 2

                                              I like the email workflow, but I also have to be realistic - it is unlikely that my colleagues or drive-by contributors would adopt it. So, in practice it will mean fewer contributions and less cooperation.

                                              The great thing about this is it’s not all-or-nothing.

                                              For https://fennel-lang.org we accept patches over the mailing list or from GitHub pull requests. Casual contributions tend to come from GitHub, while the core contributors send patches to the mailing list and discuss them there. Conveniently, casual contributions tend to require less back-and-forth review, (so GitHub’s poor UI for their review features is less frustrating) while the big meaty patches going to the mailing list benefit more from the nicer review flow.

                                            6. 3

                                              … that is if you’ve managed to set it up in the first place, probably without an opportunity to test it – that means that you have send your commit, not knowing what will come out, to test your setup, your configuration and the command you chose in the first place, which puts quite a lot of pressure, especially on people who have little experience with projects, let alone email-projects.

                                              That being said, web tools are planned to seamlessly integrate with this workflow from a browser.

                                              very nice.

                                              1. 2

                                                Nah, on sr.ht I have an open policy of “if you’re unsure about your setup, send the patch to me (sir@cmpwn.com) first and I’d be happy to make sure it’s correct”.

                                                1. 2

                                                  I wonder if it would make sense to set up a “lint my prospective patch” email address you could send your patch to first which could point out common mistakes, assuming that kind of thing is easy to write code to detect.

                                                  1. 2

                                                    I plan on linting all incoming emails to lists.sr.ht to find common mistakes like this and reject the email with advice on how to fix it.

                                                    1. 1

                                                      If you can get this running well and cheaply, you could potentially do an end run around people’s send email setup related issues by hosting a “well formed, signed, patches-only” open email relay, and local git config instructions.

                                                  2. 1

                                                    Is there a way or a plan to have a patch-upload form for example? That might be helpful for beginners.

                                                    1. 3

                                                      Yes, I plan on having a web UI which acts as a frontend to git send-email.

                                              2. 4

                                                I like that it’s using e-mail so it’s “federated” and decentralized by default.

                                                The e-mail workflow has two problems though:

                                                • integrations: usually projects have a lot of checks that can be automated (“DCO present”, “builds correctly”), for e-mail workflow this kind of stuff needs to be built (check out how Postgres does it),
                                                • client configuration: to correctly use this workflow, one need to configure git send-email (setting up credentials for example), project configuration (correct sendemail.to and format.subjectprefix) and e-mail client to send plain text, 72-characters wrapped messages. Apparently not everyone does that.

                                                Mailing lists vs Github nicely summarizes benefits of ML over Github but also highlight the number of things maintainers need to setup to run their projects on ML that Github gives them “for free”.

                                                From my point of view sr.ht looks like a great way to validate the idea if it’s possible bring easy project collaboration from Github to MLs.

                                                1. 2

                                                  usually projects have a lot of checks that can be automated

                                                  This is planned on being addressed soon on sr.ht with dispatch.sr.ht, which is used today to allow GitHub users to run CI on builds.sr.ht. The same will be possible with patches that arrive on lists.sr.ht.

                                                  client configuration

                                                  There’s a guide for send-email:

                                                  https://man.sr.ht/git.sr.ht/send-email.md

                                                  As for other emails, I’m working on some more tools to detect incorrectly configured clients and reject emails with advice on how to fix it.

                                                  Thanks for the feedback!

                                                  1. 2

                                                    I’m really interested in how far can one push this model.

                                                    Would it build the patch and e-mail back build results? For example with a link to build results and a quick summary?

                                                    Are you also planning for some aggregation of patches? (Similar to what Postgres has). For example Gerrit uses Change-Id to correlate new patches that replace old ones. Would you for example use Message-Id and In-Reply-To with [Patch v2] to present a list on a web interface of patches that are new / accepted / rejected? This interface could be operated from e-mail too I think, e.g. mailing LGTM would switch a flag (with DKIM validation so that the vote is not spoofed).

                                                    By the way I really like how sr.ht is challenging status-quo of existing solutions that just want to mimic GitHub without thinking about basic principles.

                                                    Good luck!

                                                    1. 7

                                                      Would it build the patch and e-mail back build results? For example with a link to build results and a quick summary?

                                                      Yep, and a link to a full build log as well.

                                                      Would you for example use Message-Id and In-Reply-To with [Patch v2] to present a list on a web interface of patches that are new / accepted / rejected?

                                                      Yep!

                                                      This interface could be operated from e-mail too I think, e.g. mailing LGTM would switch a flag (with DKIM validation so that the vote is not spoofed).

                                                      Aye.

                                                      1. 3

                                                        Great!

                                                        By the way I admire you pro-active approach of not only explaining the problem but also building beautiful software that solves the problem! 👍

                                                2. 3

                                                  emailing code to other people sounds like something from 20 years ago

                                                  At least OpenBSD is still doing that on the regular on the tech@ mailing list. It definitely still works.

                                                  1. 2

                                                    And I love it. It’s so damn easy to just email a one-off diff and watch someone land it. No accounts, no registration, no forking repos and dealing with fancy weird web UIs…

                                                  2. 3

                                                    One day, the current generation of “Email is SO 5 minutes ago!” kids are going to wake up and realize that e-mail is an amazing tool.

                                                    Or so I’d like to think :)

                                                    1. 1

                                                      I could be convinced. What’s your argument in favor of email?

                                                      1. 3
                                                        • Inherently de-centralized
                                                        • Can be tuned for nearly real time end to end response of low bandwidth batch processing for where network is at a premium
                                                        • Vendor neutral
                                                        • As rich or as minimal as you want it to be
                                                        • Arbitrary context types - you can send everything from 7 bit ASCII to arbitrarily complex HTML/CSS and varying payload types
                                                        • Readable with everything from /bin/cat to a nice client like Thunderbird and everything in between
                                                        • Rich capabilities for conversation threading
                                                        • Rich search capability client and server side
                                                        • Myriad archival and backup options

                                                        The list goes on.

                                                        For a more end user/business-centric version of this see In Defense of Email

                                                  1. 3

                                                    Nice overview of the concurrent map implementation, and it’s cool to see Lobsters used in a benchmark like this!

                                                    But, does the final benchmark strike anyone else as a bit unfair? Comparing the author’s in-memory database to MySQL and “unnamed other database”, the former and presumably the latter of which will persist writes to disk, doesn’t seem like an apples-to-apples comparison IMHO. Also, the presenter’s comment about “we ran these exact queries” suggests that they didn’t opt in to having a materialized view on the legacy system tests[1]. It might have been nice to compare against another in-memory system like Hekaton or a dataflow system like the aforementioned Naiad in order to better understand where the impressive speedup comes from, since there are a lot of variables in play here.

                                                    Basically, as a systems builder, I’m having trouble teasing out what the lessons are here: is it the contribution here most strongly the concurrent map data structure, the by-default view materialisation, or simply that it isn’t fsync()ing the writes to disk like the syscall’s going out of style?

                                                    [1] edit: per alynpost’s comment, it appears they did, so mea culpa.

                                                    1. 4

                                                      Hi Nathan! I’m the presenter/author. The comparison is actually with the databases running entirely in-memory, with all persistence turned off, and with transactional guarantees turned as low as we can get them. We also compare against plain memcached, which approximates a cache system with no processing overheads. The paper also has a comparison with Naiad in a distributed setting.

                                                      This talk was intended to be a high-level overview, not a deep technical dive into the underlying research of Noria, which is why you feel like it’s light on details. If you’re looking for the research contributions, I think you’d find the paper the best place to look! The paper is about the parts of Noria that are actually novel, namely partially-stateful, dynamic data-flow, as opposed to the relatively high-level ideas of materialized views :)

                                                      1. 1

                                                        Hi Jon! Thanks again for the talk and for the clarification; that seems pretty reasonable. (I didn’t mean for the post to be a criticism; sorry if it was read that way! I thought the talk was good and was curious about particulars, and didn’t feel it was light on details. Will add the paper to the reading queue!)

                                                        1. 2

                                                          Haha, not at all, I understood what you meant :) If you’re interested in these kinds of systems, I think and hope you’ll find the paper a good read. Feel free to reach out if you have more questions as you read it!

                                                      1. 4

                                                        Great to see that data get used. One question I had after skimming the paper: how much of this performance improvement do you attribute to Noria’s data-flow computation? One of the frustrations comparing these things is that MySQL has a lot of features that Noria presumably doesn’t, makes it sort of apples-and-oranges. But I really like this model of incremental computation, I see it as a larger popular trend in programming tools.

                                                        1. 3

                                                          Hmm, I’m not entirely sure I understand the question, but I think maybe there’s a misunderstanding about how the system works. Most of the performance improvement in Noria is due to the fact that basically all SELECTs are now direct cache hits. The data-flow is just the way that we ensure that the results for those SELECT queries remain updated as the underlying data changes (e.g., as new votes are added). This design basically decouples the performance of reads and writes: reads are always fast, and the write throughput is determined by how many things the write touches; we’re nowhere near the limit of how many reads per second we could do for the lobsters queries (see the vote results). As queries get more complicated (and again, we can support all the queries in the lobsters source code), reads do not get any slower, only writes do. It is true that now the writes are slower, but they are also rarer. In the Lobsters workload at the moment, the biggest bottleneck is the write path for updates to read_ribbons, and that’s what’s preventing us from scaling beyond 5x MySQL. Sharding that write path may be the way to resolve that issue down the line.

                                                          As for feature parity, I don’t know exactly what you’re referencing? Is there a particular feature you’re worried about Noria not having that you rely on for MySQL? Not sure if it came across in the paper, but you can take unmodified applications that just use mysql client libraries and just plug’n’play them with Noria. At least that’s the idea modulo our SQL parser and query planner still not being quite as mature as MySQL’s.

                                                          1. 4

                                                            So Noria maintains materialized views, sort of like flexviews but with automatic refreshing or like pipelinedb but base data is permanent (table) rather than ephemeral (stream). Also reminds me somewhat of ksql. And since it is the database, the application doesn’t need to handle complicated and error-prone cache invalidation (e.g. in the typical MySQL + memcache scenario). Pretty neat!

                                                            I had the same question about apples-to-oranges comparison though. For example, transaction support, foreign keys, different index types, triggers, rocksdb vs innodb implications.

                                                            1. 3

                                                              Yup, you are totally right that there are features of more traditional databases that we do not yet support. This is still a research prototype, so it’s focused on the research problems first and foremost. We don’t believe any of those additional features to be fundamentally impossible in the Noria paradigm though — for example, we’re designing a scheme for adding transactions, and we believe we can do it without adding much overhead to query execution in the common case!

                                                              Some of these other features are also really optimization details. For instance, since Noria knows the application’s queries, it could automatically choose indexes that fit the query load (even though currently it only uses hash indexes). Similarly, RocksDB vs InnoDB shouldn’t matter to the application. We use RocksDB only for storing the base table data, not for storing anything else, so it’s mostly just there for persistence, and rarely affects performance.

                                                              As for foreign keys and triggers, those should be pretty easy to add, and mostly just need engineering, not research. In a sense, triggers are really just additional operators in the graph, so they’re almost a non-feature in Noria.

                                                              1. 3

                                                                You may also find the discussion on Reddit interesting.

                                                              2. 2

                                                                My question isn’t about how the system works, it’s about the breadth of MySQL, which pays a performance cost for lots of features I presume Noria doesn’t have. Multi-master setups, sharding, charset collations, many more data types, support for at least five operating systems, date and time functions, multiple storage formats, a million things. Even if Lobsters doesn’t use them, some of those are going to result in conditionals on the hot path to serving even very simple, performant queries like select * from users where id = 123 and account for some of the performance difference. I say it’s sort of an apples-and-oranges comparison because Noria and MySQL have such different featuresets - if it were possible to compile a version of MySQL that dropped support for every feature Lobsters doesn’t use, I wonder if that wouldn’t be in the neighborhood of 5x faster. I have so little intuition for it I wouldn’t be surprised at 1.01x or 20x.

                                                                Edit: ah, and after I hit post I reloaded the page to see @tobym made this point and you already responded to it. I’ll check out the reddit link. :)

                                                                1. 3

                                                                  In addition to my response to @tobym, let me try to address some of your specific concerns too. First, Noria already supports multi-machine distribution and sharding, and replication is already nearly done. Noria is also more flexible than MySQL in its data types, since it doesn’t have strict column typing. If we did apply the same schema strictness as MySQL, that would improve our performance, since we could specialize data-structures to known types. While it is true that we don’t support as many data types as MySQL, adding news ones is pretty straightforward, and we already support quite a few. Similarly, adding date and time functions should be straightforward – they are just new projection and filter operations. Noria should also run without modifications on Linux, macOS, and Windows.

                                                                  As for multiple storage formats, Noria is, in a sense, arguing that you as the developer shouldn’t have to think about that. You should tell the database what your queries are, and it should determine how best to persist and cache the data and the query results. Are there particular features associated with the storage systems that you had in mind?

                                                                  You are right though that MySQL does more than Noria does, and that that adds overheads that Noria does not have in some cases. However, most of Noria’s performance advantage comes from the model — computing on write instead of read — as opposed to implementation. MySQL fundamentally has to compute things on reads, whereas Noria does not, and with most operations being reads, that translates to speed-ups that MySQL cannot recover. It would be great to disable lots of MySQL features, but it is unlikely to change the picture much due to this fundamental design difference.

                                                                  The one exception to this is transactions: it could be that transactions are just so expensive to provide, that the MySQL was is just way faster than anything you could achieve in the Noria paradigm. We don’t believe this to be the case though, as we already have a design sketch that adds transactions and strongly consistent reads to Noria while introducing nearly no overhead in the common case.

                                                            1. 2

                                                              Dude, that’s awesome! Great work! I see you published the source. Were there any patents on this or that not a concern?

                                                              1. 3

                                                                Thanks! We certainly haven’t filed for any patents, and I’m not aware of any patents relevant to this. As you can see from the paper’s related work section, this does build on a lot of insights from other fields and systems, but I think it nonetheless carves out its own little nook of data-flow and database research that others haven’t really explored before. As for publishing it as open-source, that has been my goal all along. I don’t personally have any desire for commercializing this, though I also think it’s something that would work very well under a Redis-like or Postgres-like model where it could be a serious open-source production system with enterprise support.

                                                              1. 5

                                                                To everyone who’s curious about the results of this, and to @pushcx in particular, the paper is now published! https://jon.tsp.io/papers/osdi18-noria.pdf

                                                                1. 1

                                                                  Thanks for publishing it. This is a really, old thread, though. I suggest you submit it as its own story with authored by and a reference to this thread in text field. Then it will get more attention.

                                                                  1. 2

                                                                    Yeah, @pushcx made the same suggestions privately. That was the plan all along, but I wanted to wait until we’d held the conference talk (which we just did!) and polished up the codebase + docs a little. Submitted it now as https://lobste.rs/s/uxgjha/noria_dynamic_partially_stateful_data !

                                                                1. 7

                                                                  There seems to be more “intermediate” level Rust posts recently. OP is a good example.

                                                                  1. 8

                                                                    It’s true, and great to see! It might be in response to the Rust 2018 roadmap, which specifically highlights:

                                                                    One of the strongest messages we’ve heard from production users, and the 2017 survey, is that people need more resources to take them from understanding Rust’s concepts to knowing how to use them effectively. The roadmap does not stipulate exactly what these resources should look like — probably there should be several kinds — but commits us as a community to putting significant work into this space, and ending the year with some solid new material.

                                                                    <plug shameless=true> For example, I’ve been doing Rust live-coding sessions where we build non-trivial (and useful) Rust crates (e.g., I recently did one on writing an asynchronous SSH client), and those have been pretty well received. It seems like we’re getting more and more Rustaceans who want to try Rust for “real things”, and that’s pretty exciting! It’s good to see the supply meeting that demand.

                                                                    1. 6

                                                                      (OP) I think it’s a virtuous cycle. I got into Rust because I read some good entry level posts that talked about how it was ready for prime time.

                                                                    1. 6

                                                                      Results here. Thanks for your patience, @jonhoo, I was traveling.

                                                                      SELECT ROUND(upvotes + downvotes, -2) AS bucket, COUNT(id), RPAD('', LN(COUNT(id)), '*') FROM stories GROUP BY bucket
                                                                      
                                                                      +--------+-----------+------------------------------+
                                                                      | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                                                                      +--------+-----------+------------------------------+
                                                                      |      0 |     40286 | ***********                  |
                                                                      |    100 |       360 | ******                       |
                                                                      |    200 |         2 | *                            |
                                                                      |    300 |         2 | *                            |
                                                                      +--------+-----------+------------------------------+
                                                                      4 rows in set (0.30 sec)
                                                                      

                                                                      SELECT ROUND(comments_count, -2) AS bucket, COUNT(id), RPAD(’’, LN(COUNT(id)), ‘*’) FROM stories GROUP BY bucket;

                                                                      +--------+-----------+------------------------------+
                                                                      | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                                                                      +--------+-----------+------------------------------+
                                                                      |      0 |     40482 | ***********                  |
                                                                      |    100 |       165 | *****                        |
                                                                      |    200 |         3 | *                            |
                                                                      +--------+-----------+------------------------------+
                                                                      3 rows in set (0.29 sec)
                                                                      

                                                                      SELECT ROUND(counted.cnt, -2) AS bucket, COUNT(user_id), RPAD(’’, LN(COUNT(user_id)), ‘*’) FROM ( SELECT user_id, COUNT(id) AS cnt FROM votes GROUP BY user_id) AS counted GROUP BY bucket;

                                                                      +--------+----------------+-----------------------------------+
                                                                      | bucket | COUNT(user_id) | RPAD('', LN(COUNT(user_id)), '*') |
                                                                      +--------+----------------+-----------------------------------+
                                                                      |      0 |           4439 | ********                          |
                                                                      |    100 |            562 | ******                            |
                                                                      |    200 |            206 | *****                             |
                                                                      |    300 |            131 | *****                             |
                                                                      |    400 |             92 | *****                             |
                                                                      |    500 |             45 | ****                              |
                                                                      |    600 |             28 | ***                               |
                                                                      |    700 |             31 | ***                               |
                                                                      |    800 |             29 | ***                               |
                                                                      |    900 |             32 | ***                               |
                                                                      |   1000 |             14 | ***                               |
                                                                      |   1100 |             23 | ***                               |
                                                                      |   1200 |             18 | ***                               |
                                                                      |   1300 |             18 | ***                               |
                                                                      |   1400 |             13 | ***                               |
                                                                      |   1500 |             10 | **                                |
                                                                      |   1600 |              3 | *                                 |
                                                                      |   1700 |              7 | **                                |
                                                                      |   1800 |              8 | **                                |
                                                                      |   1900 |              6 | **                                |
                                                                      |   2000 |              3 | *                                 |
                                                                      |   2100 |              3 | *                                 |
                                                                      |   2200 |              4 | *                                 |
                                                                      |   2300 |              6 | **                                |
                                                                      |   2400 |              4 | *                                 |
                                                                      |   2500 |              2 | *                                 |
                                                                      |   2600 |              3 | *                                 |
                                                                      |   2700 |              2 | *                                 |
                                                                      |   2800 |              2 | *                                 |
                                                                      |   2900 |              3 | *                                 |
                                                                      |   3000 |              2 | *                                 |
                                                                      |   3100 |              3 | *                                 |
                                                                      |   3200 |              2 | *                                 |
                                                                      |   3300 |              2 | *                                 |
                                                                      |   3400 |              1 |                                   |
                                                                      |   3600 |              1 |                                   |
                                                                      |   3800 |              1 |                                   |
                                                                      |   3900 |              1 |                                   |
                                                                      |   4000 |              1 |                                   |
                                                                      |   4100 |              3 | *                                 |
                                                                      |   4200 |              1 |                                   |
                                                                      |   4300 |              6 | **                                |
                                                                      |   4400 |              1 |                                   |
                                                                      |   4500 |              2 | *                                 |
                                                                      |   4600 |              1 |                                   |
                                                                      |   4800 |              1 |                                   |
                                                                      |   5000 |              1 |                                   |
                                                                      |   5100 |              1 |                                   |
                                                                      |   5400 |              1 |                                   |
                                                                      |   5500 |              1 |                                   |
                                                                      |   5800 |              2 | *                                 |
                                                                      |   5900 |              1 |                                   |
                                                                      |   6000 |              1 |                                   |
                                                                      |   6400 |              2 | *                                 |
                                                                      |   6800 |              1 |                                   |
                                                                      |   7100 |              1 |                                   |
                                                                      |   7700 |              1 |                                   |
                                                                      |   8400 |              1 |                                   |
                                                                      |   8700 |              1 |                                   |
                                                                      |   8900 |              1 |                                   |
                                                                      |  10100 |              1 |                                   |
                                                                      |  10900 |              1 |                                   |
                                                                      |  13500 |              1 |                                   |
                                                                      |  14500 |              1 |                                   |
                                                                      |  15100 |              1 |                                   |
                                                                      +--------+----------------+-----------------------------------+
                                                                      65 rows in set (0.15 sec)
                                                                      

                                                                      nginx grep:

                                                                      This log started 2018-02-11 04:40:31 UTC. I edited some tokens and login information out, replacing with [elided].

                                                                      1133494 GET /stories/X/Y
                                                                       590907 GET /
                                                                        71829 GET /comments
                                                                        19150 GET /comments.rss
                                                                        12020 POST /comments/X/upvote
                                                                         9033 POST /stories/X/upvote
                                                                         6126 POST /comments
                                                                         4591 GET /login
                                                                         3681 GET /comments/X/reply
                                                                         1660 POST /login
                                                                         1429 GET /comments/X/edit
                                                                         1301 POST /comments/X/2.0"
                                                                         1174 POST /stories/X/save
                                                                         1019 POST /stories
                                                                          929 POST /stories/X/hide
                                                                          637 POST /comments/X/downvote
                                                                          535 GET /comments/X/2
                                                                          390 POST /stories/X/2.0"
                                                                          224 POST /comments/X/unvote
                                                                          202 GET /comments/X/3
                                                                          198 GET /comments/X/2.0"
                                                                          190 POST /stories/X/downvote
                                                                          174 GET /login/forgot_password
                                                                          150 POST /stories/X/suggest
                                                                          148 POST /stories/X/unvote
                                                                          148 GET /login/2fa
                                                                          132 POST /login/2fa_verify
                                                                          109 POST /login/reset_password
                                                                           96 GET /comments/X/4
                                                                           94 POST /stories/X/unsave
                                                                           63 POST /comments/X/delete
                                                                           55 POST /logout
                                                                           54 GET /comments/X/5
                                                                           37 POST /stories/X/unhide
                                                                           34 GET /comments.rss?token=[elided]
                                                                           29 GET /comments/X/6
                                                                           27 POST /login/set_new_password
                                                                           19 POST /comments/X/1.1"
                                                                           19 GET /comments/X/7
                                                                           19 GET /comments/X/1.1"
                                                                           12 POST /
                                                                           12 GET /comments/X/3170
                                                                           10 POST /comments/X/undelete
                                                                            9 GET /comments/X/8
                                                                            8 GET /comments/X/9
                                                                            8 GET /comments/X/10
                                                                            7 GET /logout
                                                                            5 GET /login/reset_password
                                                                            5 GET /comments/X/11
                                                                            4 GET /comments/X/
                                                                            2 POST /stories/X/1.1"
                                                                            2 POST /comments?RzVI%3D3496%20AND%201%3D1%20UNION%20ALL%20SELECT%201%2CNULL%2C%27%3Cscript%3Ealert%28%22XSS%22%29%3C%2Fscript%3E%27%2Ctable_name%20FROM%20information_schema.tables%20WHERE%202%3E1--%2F%2A%2A%2F%3B%20EXEC%20xp_cmdshell%28%27cat%20..%2F..%2F..%2Fetc%2Fpasswd%27%29%23
                                                                            2 GET /comments/X/5000
                                                                            2 GET /comments/X/12
                                                                            1 POST /stories/X/do_we_need_move_away_from_elm
                                                                            1 POST /login?utf8=%E2%9C%93&authenticity_token=[elided]&email=[elided]&password=[elided]&commit=Login&referer=https%3A%2F%2Flobste.rs%2Fs%2Fcgqz3p%2Fdo_we_need_move_away_from_elm
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /login/set_new_password?token=[elided]
                                                                            1 GET /comments/X/7000
                                                                            1 GET /comments/X/6000
                                                                            1 GET /comments/X/5700
                                                                            1 GET /comments/X/5600
                                                                            1 GET /comments/X/5570
                                                                            1 GET /comments/X/5559
                                                                            1 GET /comments/X/5558
                                                                            1 GET /comments/X/5555
                                                                            1 GET /comments/X/5550
                                                                            1 GET /comments/X/5500
                                                                            1 GET /comments/X/20000
                                                                            1 GET /comments/X/200
                                                                            1 GET /comments/X/20
                                                                            1 GET /comments/X/14
                                                                            1 GET /comments/X/13
                                                                            1 GET /comments/X/10000
                                                                            1 GET /comments.rss?token=[elided]
                                                                      
                                                                      1. 1

                                                                        This is fantastic, thank you!

                                                                        Looks like my log grepping wasn’t perfect, but we can clean up the remainder pretty easily. Feel free to edit out any lines from the nginx log with a count < 10. I posted a slightly updated command in a comment a few days ago that includes /recent and /u and makes some other fixes – if you could run that one (mostly for the /recent numbers), that’d be awesome.

                                                                        It looks like the comment and count numbers are a little smaller than I expected, which makes the distribution hard to infer from the few resulting data points. Any chance you could re-run the first two commands with ROUND(..., -1) instead (rounds to the nearest 10 instead of nearest 100)?

                                                                        1. 2

                                                                          The updated nginx grep didn’t work. My sed (identifies itself as sed (GNU sed) 4.4) errored on three latter instances of //. I took a stab at correcting it but couldn’t immediately spot what it should’ve been.

                                                                          And, sure, here’s the first two with the smaller buckets:

                                                                          +--------+-----------+------------------------------+
                                                                          | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                                                                          +--------+-----------+------------------------------+
                                                                          |      0 |     16724 | **********                   |
                                                                          |     10 |     16393 | **********                   |
                                                                          |     20 |      4601 | ********                     |
                                                                          |     30 |      1707 | *******                      |
                                                                          |     40 |       680 | *******                      |
                                                                          |     50 |       281 | ******                       |
                                                                          |     60 |       128 | *****                        |
                                                                          |     70 |        60 | ****                         |
                                                                          |     80 |        35 | ****                         |
                                                                          |     90 |        16 | ***                          |
                                                                          |    100 |         4 | *                            |
                                                                          |    110 |         4 | *                            |
                                                                          |    120 |        10 | **                           |
                                                                          |    130 |         1 |                              |
                                                                          |    140 |         2 | *                            |
                                                                          |    160 |         1 |                              |
                                                                          |    210 |         1 |                              |
                                                                          |    250 |         1 |                              |
                                                                          |    290 |         1 |                              |
                                                                          +--------+-----------+------------------------------+
                                                                          19 rows in set (0.32 sec)
                                                                          
                                                                          
                                                                          +--------+-----------+------------------------------+
                                                                          | bucket | COUNT(id) | RPAD('', LN(COUNT(id)), '*') |
                                                                          +--------+-----------+------------------------------+
                                                                          |      0 |     33974 | **********                   |
                                                                          |     10 |      4831 | ********                     |
                                                                          |     20 |      1029 | *******                      |
                                                                          |     30 |       401 | ******                       |
                                                                          |     40 |       193 | *****                        |
                                                                          |     50 |       103 | *****                        |
                                                                          |     60 |        50 | ****                         |
                                                                          |     70 |        24 | ***                          |
                                                                          |     80 |        18 | ***                          |
                                                                          |     90 |        10 | **                           |
                                                                          |    100 |         5 | **                           |
                                                                          |    110 |         1 |                              |
                                                                          |    120 |         5 | **                           |
                                                                          |    130 |         2 | *                            |
                                                                          |    140 |         1 |                              |
                                                                          |    150 |         2 | *                            |
                                                                          |    170 |         1 |                              |
                                                                          +--------+-----------+------------------------------+
                                                                          17 rows in set (0.31 sec)
                                                                          
                                                                          1. 1

                                                                            Huh, seems like something weird happened when I copied it. Here, try this one?

                                                                            grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password" lobsters.log
                                                                             | sed -e 's/.*\(GET\|POST\)/\1/'
                                                                                   -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                                                                                   -e 's/\/u\/[^\/ ]*/\/u\/X/'
                                                                                   -e '/^GET / s/X\/.\+/X/'
                                                                                   -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                                                                             | awk '/^(GET|POST)/ {print $1" "$2}'
                                                                             | grep -E ' /(stories|comments|login|logout|u/|$)'
                                                                             | sort | uniq -c | sort -rnk1,1
                                                                            
                                                                            1. 2
                                                                              1139580 GET /stories/X
                                                                               594323 GET /
                                                                               159959 GET /u/X
                                                                                72320 GET /comments
                                                                                19287 GET /comments.rss
                                                                                12088 POST /comments/X/upvote
                                                                                 9081 POST /stories/X/upvote
                                                                                 6160 POST /comments
                                                                                 4597 GET /login
                                                                                 3704 GET /comments/X/reply
                                                                                 1663 POST /login
                                                                                 1433 GET /comments/X/edit
                                                                                 1322 POST /comments/X
                                                                                 1188 POST /stories/X/save
                                                                                 1022 POST /stories
                                                                                  932 POST /stories/X/hide
                                                                                  641 POST /comments/X/downvote
                                                                                  542 GET /comments/X/2
                                                                                  395 POST /stories/X
                                                                                  224 POST /comments/X/unvote
                                                                                  219 GET /comments/X
                                                                                  207 GET /comments/X/3
                                                                                  192 POST /stories/X/downvote
                                                                                  153 GET /login/2fa
                                                                                  150 POST /stories/X/suggest
                                                                                  149 POST /stories/X/unvote
                                                                                  137 POST /login/2fa_verify
                                                                                  101 GET /comments/X/4
                                                                                   94 POST /stories/X/unsave
                                                                                   63 POST /comments/X/delete
                                                                                   55 POST /logout
                                                                                   55 GET /comments/X/5
                                                                                   37 POST /stories/X/unhide
                                                                                   34 GET /comments.rss?token=[elided]
                                                                                   29 GET /comments/X/6
                                                                                   19 GET /comments/X/7
                                                                                   12 POST /
                                                                                   12 GET /comments/X/3170
                                                                                   10 POST /comments/X/undelete
                                                                              
                                                                              1. 1

                                                                                Oops, looks like I forgot /recent again in that query :( Could you run:

                                                                                grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password|token" lobsters.log
                                                                                 | sed -e 's/.*\(GET\|POST\)/\1/'
                                                                                       -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                                                                                       -e 's/\/u\/[^\/ ]*/\/u\/X/'
                                                                                       -e '/^GET / s/X\/.\+/X/'
                                                                                       -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                                                                                 | awk '/^(GET|POST)/ {print $1" "$2}'
                                                                                 | grep -E ' /(stories|comments|login|logout|u/|recent$)'
                                                                                 | sort | uniq -c | sort -rnk1,1
                                                                                

                                                                                And also let me know the start time of the log and the time you run the query (so I have an idea of the interval)?

                                                                                The workload generator is also shaping up over here :)

                                                                                1. 2

                                                                                  Timestamp at the start is 11/Feb/2018:04:40:31 +0000, end is 27/Mar/2018:01:26:49 +0000. I’m glad to see these stats going into your work, and I look forward to seeing your finished work submitted as a story.

                                                                                  1615623 GET /stories/X
                                                                                   193915 GET /u/X
                                                                                   105680 GET /comments
                                                                                    27731 GET /comments.rss
                                                                                    23155 GET /recent
                                                                                    18235 POST /comments/X/upvote
                                                                                    13730 POST /stories/X/upvote
                                                                                     9141 POST /comments
                                                                                     7137 GET /login
                                                                                     5572 GET /comments/X/reply
                                                                                     2512 POST /login
                                                                                     2233 GET /comments/X/edit
                                                                                     2055 POST /comments/X
                                                                                     1735 POST /stories/X/save
                                                                                     1522 POST /stories
                                                                                     1422 POST /stories/X/hide
                                                                                     1159 POST /comments/X/downvote
                                                                                      801 GET /comments/X/2
                                                                                      600 POST /stories/X
                                                                                      391 POST /comments/X/unvote
                                                                                      347 POST /stories/X/downvote
                                                                                      340 GET /comments/X
                                                                                      314 GET /comments/X/3
                                                                                      261 POST /stories/X/unvote
                                                                                      212 GET /login/2fa
                                                                                      203 POST /stories/X/suggest
                                                                                      191 POST /login/2fa_verify
                                                                                      156 GET /comments/X/4
                                                                                      125 POST /stories/X/unsave
                                                                                       99 POST /comments/X/delete
                                                                                       94 POST /logout
                                                                                       79 GET /comments/X/5
                                                                                       60 POST /stories/X/unhide
                                                                                       45 GET /comments/X/6
                                                                                       30 GET /comments/X/7
                                                                                       15 POST /comments/X/undelete
                                                                                       14 GET /comments/X/8
                                                                                       13 GET /comments/X/9
                                                                                       12 GET /comments/X/3170
                                                                                       12 GET /comments/X/10
                                                                                        9 GET /comments/X/
                                                                                  
                                                                                  1. 3

                                                                                    Yeah, the workload generator is shaping up pretty nicely! FWIW, it looks like the next scaling bottleneck lobste.rs is likely to experience is Ruby at ~100x current load on an 8-core server, followed by MySQL scaling of the transactional update to the traffic stats around ~2000x current load. So still quite a bit of headroom!

                                                                                    Also, I made (yet another) stupid with the query above: it now doesn’t include the frontpage, which is arguably quite important, due to a missed |. Try again? O:)

                                                                                    grep -vE "assets|fetch_url_attributes|check_url_dupe|set_new_password|reset_password|forgot_password|token" lobsters.log
                                                                                     | sed -e 's/.*\(GET\|POST\)/\1/'
                                                                                           -e 's/\/\(s\|stories\)\/[^\/ ]*/\/stories\/X/'
                                                                                           -e 's/\/u\/[^\/ ]*/\/u\/X/'
                                                                                           -e '/^GET / s/X\/.\+/X/'
                                                                                           -e 's/\/comments\/[^\/ ]*/\/comments\/X/'
                                                                                     | awk '/^(GET|POST)/ {print $1" "$2}'
                                                                                     | grep -E ' /(stories|comments|login|logout|u/|recent|$)'
                                                                                     | sort | uniq -c | sort -rnk1,1
                                                                                    
                                                                                    1. 2

                                                                                      Nice to hear we should have smooth sailing for some time.

                                                                                       870987 GET /
                                                                                       193915 GET /u/X
                                                                                       105680 GET /comments
                                                                                        27731 GET /comments.rss
                                                                                        23155 GET /recent
                                                                                        18235 POST /comments/X/upvote
                                                                                        13730 POST /stories/X/upvote
                                                                                         9141 POST /comments
                                                                                         7137 GET /login
                                                                                         5572 GET /comments/X/reply
                                                                                         2512 POST /login
                                                                                         2233 GET /comments/X/edit
                                                                                         2055 POST /comments/X
                                                                                         1735 POST /stories/X/save
                                                                                         1574 GET /recent/page/2
                                                                                         1522 POST /stories
                                                                                         1422 POST /stories/X/hide
                                                                                         1159 POST /comments/X/downvote
                                                                                          801 GET /comments/X/2
                                                                                          606 GET /recent/page/3
                                                                                          600 POST /stories/X
                                                                                          391 POST /comments/X/unvote
                                                                                          347 POST /stories/X/downvote
                                                                                          340 GET /comments/X
                                                                                          316 GET /recent/page/4
                                                                                          314 GET /comments/X/3
                                                                                          261 POST /stories/X/unvote
                                                                                          212 GET /login/2fa
                                                                                          203 POST /stories/X/suggest
                                                                                          191 POST /login/2fa_verify
                                                                                          187 GET /recent/page/5
                                                                                          156 GET /comments/X/4
                                                                                          125 POST /stories/X/unsave
                                                                                          125 GET /recent/page/6
                                                                                          110 GET /recent/
                                                                                           99 POST /comments/X/delete
                                                                                           98 GET /recent/page/7
                                                                                           94 POST /logout
                                                                                           83 GET /recent/page/8
                                                                                           79 GET /comments/X/5
                                                                                           73 GET /recent/page/9
                                                                                           60 POST /stories/X/unhide
                                                                                           59 GET /recent/page/10
                                                                                           47 GET /recent/page/11
                                                                                           45 GET /comments/X/6
                                                                                           31 GET /recent/page/13
                                                                                           30 GET /recent/page/22
                                                                                           30 GET /comments/X/7
                                                                                           29 GET /recent/page/14
                                                                                           28 GET /recent/page/33
                                                                                           28 GET /recent/page/21
                                                                                           27 GET /recent/page/28
                                                                                           26 GET /recent/page/12
                                                                                           25 POST /
                                                                                           25 GET /recent/page/24
                                                                                           25 GET /recent/page/23
                                                                                           25 GET /recent/page/18
                                                                                           25 GET /recent/page/17
                                                                                           25 GET /recent/page/115
                                                                                           24 GET /recent/page/40
                                                                                           24 GET /recent/page/27
                                                                                           24 GET /recent/page/19
                                                                                           24 GET /recent/page/15
                                                                                           22 GET /recent/page/80
                                                                                           22 GET /recent/page/60
                                                                                           22 GET /recent/page/16
                                                                                           21 GET /recent/page/37
                                                                                           21 GET /recent/page/25
                                                                                           20 GET /recent/page/95
                                                                                           20 GET /recent/page/31
                                                                                           19 GET /recent/page/96
                                                                                           19 GET /recent/page/39
                                                                                           19 GET /recent/page/30
                                                                                           19 GET /recent/page/112
                                                                                           19 GET /recent/page/104
                                                                                           18 GET /recent/page/35
                                                                                           18 GET /recent/page/20
                                                                                           18 GET /recent/page/134
                                                                                           17 GET /recent/page/42
                                                                                           16 GET /recent/page/67
                                                                                           16 GET /recent/page/124
                                                                                           15 POST /comments/X/undelete
                                                                                           15 GET /recent/page/92
                                                                                           15 GET /recent/page/84
                                                                                           15 GET /recent/page/77
                                                                                           15 GET /recent/page/68
                                                                                           15 GET /recent/page/66
                                                                                           15 GET /recent/page/57
                                                                                           15 GET /recent/page/36
                                                                                           15 GET /recent/page/131
                                                                                           15 GET /recent/page/125
                                                                                           15 GET /recent/page/122
                                                                                           15 GET /recent/page/121
                                                                                           15 GET /recent/page/116
                                                                                           15 GET /recent/page/114
                                                                                           15 GET /recent/page/101
                                                                                           15 GET /recent/page/100
                                                                                           14 GET /recent/page/87
                                                                                           14 GET /recent/page/59
                                                                                           14 GET /recent/page/45
                                                                                           14 GET /comments/X/8
                                                                                           13 GET /recent/page/98
                                                                                           13 GET /recent/page/82
                                                                                           13 GET /recent/page/51
                                                                                           13 GET /recent/page/135
                                                                                           13 GET /recent/page/133
                                                                                           13 GET /recent/page/128
                                                                                           13 GET /recent/page/120
                                                                                           13 GET /recent/page/108
                                                                                           13 GET /comments/X/9
                                                                                           12 GET /recent/page/94
                                                                                           12 GET /recent/page/93
                                                                                           12 GET /recent/page/89
                                                                                           12 GET /recent/page/69
                                                                                           12 GET /recent/page/65
                                                                                           12 GET /recent/page/53
                                                                                           12 GET /recent/page/52
                                                                                           12 GET /recent/page/49
                                                                                           12 GET /recent/page/46
                                                                                           12 GET /recent/page/38
                                                                                           12 GET /recent/page/34
                                                                                           12 GET /recent/page/136
                                                                                           12 GET /recent/page/132
                                                                                           12 GET /recent/page/130
                                                                                           12 GET /recent/page/117
                                                                                           12 GET /recent/page/102
                                                                                           12 GET /recent/page/
                                                                                           12 GET /comments/X/3170
                                                                                           12 GET /comments/X/10
                                                                                           11 GET /recent/page/90
                                                                                           11 GET /recent/page/88
                                                                                           11 GET /recent/page/79
                                                                                           11 GET /recent/page/71
                                                                                           11 GET /recent/page/64
                                                                                           11 GET /recent/page/129
                                                                                           11 GET /recent/page/123
                                                                                           11 GET /recent/page/118
                                                                                           11 GET /recent/page/109
                                                                                           11 GET /recent/page/105
                                                                                           10 GET /recent/page/85
                                                                                           10 GET /recent/page/78
                                                                                           10 GET /recent/page/74
                                                                                           10 GET /recent/page/70
                                                                                           10 GET /recent/page/63
                                                                                           10 GET /recent/page/61
                                                                                           10 GET /recent/page/55
                                                                                           10 GET /recent/page/43
                                                                                           10 GET /recent/page/41
                                                                                           10 GET /recent/page/26
                                                                                           10 GET /recent/page/126
                                                                                           10 GET /recent/page/119
                                                                                      
                                                                                      1. 1

                                                                                        Hehe, yeah, I think you should be good ;)

                                                                                        Hmm, now there’s no entry there for GET /stories/X? Did you perhaps miss the first line when copying? Looks like the data is all from the same logging period though, so I can just manually combine them :) Thanks!

                                                                                2. 1

                                                                                  I also just realized that we’re missing the vote distribution for comments, which’d be super handy!

                                                                                  SELECT ROUND(counted.cnt, -1) AS bucket,
                                                                                         COUNT(comment_id),
                                                                                         RPAD('', LN(COUNT(comment_id)), '*')
                                                                                  FROM (
                                                                                      SELECT comment_id, COUNT(id) AS cnt
                                                                                      FROM votes WHERE comment_id IS NOT NULL
                                                                                      GROUP BY comment_id
                                                                                  ) AS counted
                                                                                  GROUP BY bucket
                                                                                  
                                                                                  1. 2
                                                                                    +--------+-------------------+--------------------------------------+
                                                                                    | bucket | COUNT(comment_id) | RPAD('', LN(COUNT(comment_id)), '*') |
                                                                                    +--------+-------------------+--------------------------------------+
                                                                                    |      0 |             89907 | ***********                          |
                                                                                    |     10 |             27614 | **********                           |
                                                                                    |     20 |              2825 | ********                             |
                                                                                    |     30 |               637 | ******                               |
                                                                                    |     40 |               177 | *****                                |
                                                                                    |     50 |                61 | ****                                 |
                                                                                    |     60 |                29 | ***                                  |
                                                                                    |     70 |                 6 | **                                   |
                                                                                    |     80 |                 5 | **                                   |
                                                                                    |     90 |                 5 | **                                   |
                                                                                    |    100 |                 3 | *                                    |
                                                                                    |    150 |                 1 |                                      |
                                                                                    +--------+-------------------+--------------------------------------+
                                                                                    
                                                                        1. 18

                                                                          Hey folks,

                                                                          Jon messaged me a day or two ago. I gave him the standard answer about these sorts of inquiries: I’m happy to run queries that don’t reveal personal info like IPs, browsing, and voting or create “worst-of” leaderboards celebrating most-downvoted users/comments/stories, that sort of thing. I can’t volunteer to write queries for people, but the schema is on GitHub and the logs are MySQL and nginx, so it’s straightforward to do.

                                                                          A couple years ago jcs ran some queries for me and I wanted to continue that especially as the public stats that answered some popular questions have been gone for a while. It’s useful for transparency and because it’s just fun to investigate interesting questions. I’ve already run a few queries for folks in the chat room (the only I can remember off the top of my head is how many total stories have been submitted; we passed 40k last month).

                                                                          I asked Jon to post publicly about this because it sounded like he had significantly more than one question he was idly curious about, to help spread awareness that I’ll run queries like this, and get the community’s thoughts on his queries and the general policy. I’ll add a note to the about page after this discussion.

                                                                          I’m going offline for a couple hours for a prior commitment before I’ll have a chance to run any of these, but it’ll leave plenty of time for a discussion to get started or folks to think up their own queries to post as comments.

                                                                          1. 4

                                                                            Wasn’t the concept of Differential Privacy developed to allow for exactly the purpose of querying databases containing personal data while maintaining as much privacy as possible? Maybe this could be employed?

                                                                            1. 5

                                                                              In this particular case I don’t think the counts are actually sensitive, so it’s unclear that applying DP is even necessary. But I’ll ping Frank McSherry who’s one of the primary proponents of DP in academia nowadays and see what he thinks :) Maybe with DP we could extract what is arguably more sensitive information (e.g., by reducing the bin widths).

                                                                              1. 2

                                                                                That seems doable, but I have the strong suspicion that if I wing it I’ll screw something up and leak personal info. So hopefully Frank can chime in with some good advice.

                                                                                1. 6

                                                                                  I’m here! :D I’m writing up some text in more of a long-form “blog post” format, to try and explain what is what without the constraints of trying to fit everything in-line here. But, some high-level points:

                                                                                  1. Operationally queries one and two are pretty easy to pull off with differential privacy (the “how many votes per article” and “how many votes per user” queries). I’ve got some code that does that, and depending on the scale you could even just use it, in principle (or if you only have a SQL interface to the logs, we may need to bang on them).

                                                                                  2. The third query is possibly not complicated, depending on my understanding of it. My sed-fu is weak, but to the extent that the query asks only for the counts of pre-enumerable strings (e.g. POST /stories/X/upvote) it should be good. If the query needs to discover what strings are important (e.g. POST /stories/X/*) then there is a bit of a problem. It is still tractable, but perhaps more of a mess than you want to casually wade into.

                                                                                  3. Probably the biggest question mark is about the privacy guarantee you want to provide. I understand that you have a relatively spartan privacy policy, which is all good, but clearly you have some interest in doing right by the users with respect to their sensitive data. The most casual privacy guarantee you can give is probably “mask the presence / absence of individual votes/views”, which would be “per log-record privacy”. You might want to provide a stronger guarantee of “mask the presence absence of entire individuals”, which could have a substantial deleterious impact on the analyses; I’m not super-clear on which guarantee you would prefer, or even the best rhetorical path to take to try and discover which one you prefer.

                                                                                  Anyhow, I’m typing things up right now and should have a post with example code, data, analyses, etc. pretty soon. At that point, it should be clearer to say “ah, well let’s just do X then” or “I didn’t realize Y; that’s fatal, unfortunately”.

                                                                                  EDIT: I’ve put up a preliminary version of a post under the idea that info sooner rather than later is more helpful. I’m afraid I got pretty excited about the first two questions and didn’t really do much about the third. The short version of the post is that one could probably release information that leads to pretty accurate distributional information about the multiplicities of votes, by articles and by users, without all of the binning. That could be handy as (elsewhere in the thread) it looks like binning coarsely kills much of the information. Take a read and I’ll iterate on the clarity of the post too.

                                                                                  1. 2

                                                                                    To follow up briefly on this: yes, it would be useful to avoid the binning so that we could feed more data to whatever regression we end up using to approximate the underlying distribution.

                                                                              2. 2

                                                                                For those who are curious, I’ve started implementing the workload generator here. It currently mostly does random requests, but once I have the statistics I’ll plug them in and it should generate more representative traffic patterns. It does require a minor [patch](https://github.com/jonhoo/trawler/blob/master/lobsters.diff to the upstream lobste.rs codebase, but that’s mostly to enable automation.

                                                                              1. 1

                                                                                @pushcx seems I missed /recent and /u in that list, which would be a little unfortunate. Here’s a fixed command:

                                                                                grep -vE "assets|fetch_url_attributes|check_url_dupe" lobsters.log
                                                                                 | sed -e 's/.*\(GET\|POST\)/\1/'
                                                                                       -e 's//\(s\|stories\)\/[^\/ ]*//stories\/X/'
                                                                                       -e 's//u\/[^\/ ]*//u\/X/'
                                                                                       -e '/^GET / s/X\/.\+/X/'
                                                                                   -e 's//comments\/[^\/ ]*//comments\/X/'
                                                                                 | awk '/^(GET|POST)/ {print $1" "$2}'
                                                                                 | grep -E ' /(stories|comments|login|logout|u/|$)'
                                                                                 | sort | uniq -c | sort -rnk1,1
                                                                                
                                                                                1. 4

                                                                                  From an sysadm pov, that kind of metrics are very interesting for monitoring purposes (I’d just add the latency with it).

                                                                                  1. You can quickly see when there is an issue after a deployment (or something else). For example if the number of upvote/hour is at 0 since your last deployment.
                                                                                  2. You can see which routes are the slowest/more used so you can improve them
                                                                                  3. See when a new deployment kills the performances
                                                                                  4. Correlate with other metrics …

                                                                                  In my previous and current jobs we’ve been using Prometheus for that and it’s SUPER useful. Although I recon that it’s probably too huge for just a Lobsters app.

                                                                                  1. 1

                                                                                    It’s a little bit trickier to extract latency measurements from the nginx logs + the db without additional instrumentation. I agree with you though, those kinds of numbers would be very useful for system monitoring. For my purposes, they’re not particularly interesting, as I’d use the resulting workload generator to generate load and then measure system performance.

                                                                                  1. 5

                                                                                    It’s 3 AM here so I haven’t thought super hard about the data you’re suggesting retrieving, but offhand it seems okay. I do have one clarifying question; this:

                                                                                    It could also be extremely useful to other researchers who may want to build solutions for “real” web applications.

                                                                                    seems to imply that the data might be shared. How? Will it be fully public? From what I see here I’m pretty sure I’d be fine with that, I’m just wondering :)

                                                                                    Also, as a side note, I assume you’ll submit your paper to Lobste.rs when you’re done ;)

                                                                                    Those performance results sound super interesting, checking out the links you posted is on my todo list now.

                                                                                    1. 5

                                                                                      I was imagining that the results of running these queries would be posted publicly by @pushcx, or whomever ends up running them. If you or others are concerned about that, I could just use them directly myself, but I think it’d be advantageous to just make them public :)

                                                                                      Hehe, yes, the paper will appear here when it is (hopefully) published.