Threads for simonw

  1. 12

    The lesson here sounds more like “bad protocols will make your client/server system slow and clumsy”, not “move all of your system’s code to the server.” The OP even acknowledges that GraphQL would have helped a lot. (Or alternatively something like CouchDB’s map/reduce query API.)

    I don’t really get the desire to avoid doing work on the client side. Your system includes a lot of generally-quite-fast CPUs provided for free by users, and the number of these scales 1::1 with the number of users. Why not offload work onto them from your limited and costly servers? Obviously you’re already using them for rendering, but you can move a lot of app logic there too.

    I’m guessing that the importance of network protocol/API design has been underappreciated by web devs. REST is great architecturally but if you use it as a cookie-cutter approach it’s non-optimal for app use. GraphQL seems a big improvement.

    1. 16

      Your system includes a lot of generally-quite-fast CPUs provided for free by users

      Yes, and if every site I’m visiting assumes that, then pretty quickly, I no longer have quite-fast CPUs to provide for free, as my laptop is slowly turning to slag due to the heat.

      1. 8

        Um, no. How many pages are you rendering simultaneously?

        1. 3

          I usually have over 100 tabs open at any one time, so a lot.

          1. 5

            If your browser actually keeps all those tabs live and running, and those pages are using CPU cycles while idling in the background and the browser doesn’t throttle them, I can’t help you… ¯\_(ツ)_/¯

            (Me, I use Safari.)

            1. 3

              Yes, but assuming three monitors you likely have three, four windows open. That’s four active tabs, Chrome put the rest of them to sleep.

              And even if you only use apps like the one from the article, and not the well-developed ones like the comment above suggests, it’s maybe five of them at the same time. And you’re probably not clicking frantically all over them at once.

              1. 2

                All I know is that when my computer slows to a crawl the fix that usually works is to go through and close a bunch of Firefox tabs and windows.

                1. 4

                  There is often one specific tab which for some reason is doing background work and ends up eating a lot of resources. When I find that one tab and close it my system goes back to normal. Like @zladuric says, browsers these days don’t let inactive tabs munch resources.

        2. 8

          I don’t really get the desire to avoid doing work on the client side.

          My understanding is that it’s the desire to avoid some work entirely. If you chop up the processing so that the client can do part of it, that carries its own overhead. How do you feel about this list?

          Building a page server-side:

          • Server: Receive page request
          • Server: Query db
          • Server: Render template
          • Server: Send page
          • Client: Receive page, render HTML

          Building a page client-side:

          • Server: Receive page request
          • Server: Send page (assuming JS is in-page. If it isn’t, add ‘client requests & server sends the JS’ to this list.)
          • Client: Receive page, render HTML (skeleton), interpret JS
          • Client: Request data
          • Server: Receive data request, query db
          • Server: Serialize data (usu. to JSON)
          • Server: Send data
          • Client: Receive data, deserialize data
          • Client: Build HTML
          • Client: Render HTML (content)

          Compare the paper Scalabiilty! But at what COST!, which found that the overhead of many parallel processing systems gave them a high “Configuration to Outperform Single Thread”.

          1. 4

            That’s an accurate list… for the first load! One attraction of doing a lot more client-side is that after the first load, the server had the same list of actions for everything you might want to do, while the client side looks more like:

            • fetch some data
            • deserialize it
            • do an in-place rerender, often much smaller than a full page load

            (Edit: on rereading your post your summary actually covers all requests, but missed how the request and response and client-side rerender can be much smaller this way. But credit where due!)

            That’s not even getting at how much easier it is to do slick transitions or to maintain application state correctly across page transitions. Client side JS state management takes a lot of crap and people claim solutions like these are simpler but… in practice many of the sites which use them have very annoying client side state weirdness because it’s actually hard to keep things in sync unless you do the full page reload. (Looking at you, GitHub.)

            1. 6

              When I’m browsing on mobile devices I rarely spend enough time on any single site for the performance benefits of a heavy initial load to kick in.

              Most of my visits are one page long - so I often end up loading heavy SPAs when a lighter, single page optimized to load fast from an un sched blank state would have served me much better.

              1. 4

                I would acknowledge that this is possible.

                But that’s almost exactly what the top comment said. People use framework of the day for a blog. Not flattening it, or remixing it or whatever.

                SPAs that I use are things like Twitter, the tab is likely always there. (And on desktop i have those CPU cores.)

                It’s like saying, I only ride on trains to work, and they’re always crowded, so trains are bad. Don’t use trains if your work is 10 minutes away.

                But as said, I acknowledge that people are building apps where they should be building sites. And we suffer as the result.

                What still irks me the most are sites with a ton of JavaScript. So it’s server-rendered, it just has a bunch of client-side JavaScript that’s unused, or loading images or ads or something.

            2. 4

              You’re ignoring a bunch of constant factors. The amount of rendering to create a small change on the page is vastly smaller than that to render a whole new page. The most optimal approach is to send only the necessary data over the network to create an incremental change. That’s how native client/server apps work.

              1. 5

                In theory yes but if in practice if the “most optimal approach” requires megabytes of JS code to be transmitted, parsed, executed through 4 levels of interpreters culminating in JIT compiling the code to native machine code all the while performing millions of checks to make sure that this complex system doesn’t result in a weird machine that can take over your computer then maybe sending a “whole new page” consisting of 200 kb of static HTML upon submitting a form would be more optimal.

                1. 4

                  In theory yes but if in practice if the “most optimal approach” requires megabytes of JS code to be transmitted, parsed, executed through 4 levels of interpreters culminating in JIT compiling the code to native machine code all the while performing millions of checks to make sure that this complex system doesn’t result in a weird machine that can take over your computer

                  This is hyperbole. Sending a ‘“whole new page” of 200 kb of static HTML’ has your userspace program block on the kernel as bytes are written into some socket buffer, NIC interrupts the OS to grab these bytes, the NIC generates packets containing the data, userspace control is then handed back to the app which waits until the OS notifies it that there’s data to read, and on and on. I can do this for anything on a non-embedded computer made in the last decade.

                  Going into detail for dramatic effect doesn’t engage with the original argument nor does it elucidate the situation. Client-side rendering makes you pay a one-time cost for consuming more CPU time and potentially more network bandwidth for less incremental CPU and bandwidth. That’s all. Making the tradeoff wisely is what matters. If I’m loading a huge Reddit or HN thread for example, it might make more sense to load some JS on the page and have it adaptively load comments as I scroll or request more content. I’ve fetched large threads on these sites from their APIs before and they can get as large as 3-4 MB when rendered as a static HTML page. Grab four of these threads and you’re looking at 12-16 MB. If I can pay a bit more on page load then I can end up transiting a lot less bandwidth through adaptive content fetching.

                  If, on the other hand, I’m viewing a small thread with a few comments, then there’s no point paying that cost. Weighing this tradeoff is key. On a mostly-text blog where you’re generating kB of content, client-side rendering is probably silly and adds more complexity, CPU, and bandwidth for little gain. If I’m viewing a Jupyter-style notebook with many plots, it probably makes more sense for me to be able to choose which pieces of content I fetch to not fetch multiple MB of content. Most cases will probably fit between these two.

                  Exploring the tradeoffs in this space (full React-style SPA, HTMX, full SSR) can help you come to a clean solution for your usecase.

                  1. 1

                    I was talking about the additional overhead required to achieve “sending only the necessary data over the network”.

            3. 4

              I don’t really get the desire to avoid doing work on the client side.

              My impression is that it is largely (1) to avoid JavaScript ecosystem and/or* (2) avoid splitting app logic in half/duplicating app logic. Ultimately, your validation needs to exist on the server too because you can’t trust clients. As a rule of thumb, SSR then makes more sense when you have lower interactivity and not much more logic than validation. CSR makes sense when you have high interactivity and substantial app logic beyond validation.

              But I’m a thoroughly backend guy so take everything that I say with a grain of salt.


              Edit: added a /or. Thought about making the change right after I posted the comment, but was lazy.

              1. 8

                (2) avoid splitting app logic in half/duplicating app logic.

                This is a really the core issue.

                For a small team, a SPA increases the amount of work because you have a backend with whatever models and then the frontend has to connect to that backend and redo the models in a way that makes sense to it. GraphQL is an attempt to cut down on how much work this is, but it’s always going to be some amount of work compared to just creating a context dictionary in your controller that you pass to the HTML renderer.

                However, for a team that is big enough to have separate frontend and backend teams, using a SPA decreases the amount of communication necessary between the frontend and backend teams (especially if using GraphQL), so even though there’s more work overall, it can be done at a higher throughput since there’s less stalling during cross team communication.

                There’s a problem with MPAs that they end up duplicating logic if something can be done either on the frontend or the backend (say you’ve got some element that can either be loaded upfront or dynamically, and you need templates to cover both scenarios). If the site is mostly static (a “page”) then the duplication cost might be fairly low, but if the page is mostly dynamic (an “app”), the duplication cost can be huge. The next generation of MPAs try to solve the duplication problem by using websockets to send the rendered partials over the wire as HTML, but this has the problem that you have to talk to the server to do anything, and that round trip isn’t free.

                The next generation of JS frameworks are trying to reduce the amount of duplication necessary to write code that works on either the backend or the frontend, but I’m not sure they’ve cracked the nut yet.

                1. 4

                  For a small team, a SPA increases the amount of work because you have a backend with whatever models and then the frontend has to connect to that backend and redo the models in a way that makes sense to it

                  Whether this is true depends on whether the web app is a client for your service or the client for your service. The big advantage of the split architecture is that it gives you a UI-agnostic web service where your web app is a single front end for that service.

                  If you never anticipate needing to provide any non-web clients to your service then this abstraction has a cost but little benefit. If you are a small team with short timelines that doesn’t need other clients for the service yet then it is cost now for benefit later, where the cost may end up being larger than the cost of refactoring to add abstractions later once the design is more stable.

                  1. 1

                    If you have an app and a website as a small team, lol, why do you hate yourself?

                    1. 4

                      The second client might not be an app, it may be some other service that is consuming your API.

                2. 4

                  (2) avoid splitting app logic in half/duplicating app logic.

                  The other thing is to avoid duplicating application state. I’m also a thoroughly a backend guy, but I’m led to understand that the difficulty of maintaining client-side application state was what led to the huge proliferation of SPA frameworks. But maintaining server-side application state is easy, and if you’re doing a pure server-side app, you expose state to the client through hypertext (HATEOAS). What these low-JS frameworks do is let you keep that principle — that the server state is always delivered to the client as hypertext — while providing more interactivity than a traditional server-side app.

                  (I agree that there are use-cases where a more thoroughly client-side implementation is needed, like games or graphics editors, or what have you.)

                  1. 1

                    Well, there’s a difference between controller-level validation and model-level validation. One is about not fucking up by sending invalid data, the other is about not fucking up by receiving invalid data. Both are important.

                  2. 4

                    Spot on.

                    this turns out to be tens (sometimes hundreds!) of requests because the general API is very normalized (yes we were discussing GraphQL at this point)

                    There’s nothing about REST I’ve ever heard of that says that resources have to be represented as separate, highly normalized SQL records, just as GraphQL is not uniquely qualified to stitch together multiple database records into the same JSON objects. GraphQL is great at other things like allowing clients to cherry-pick a single query that returns a lot of data, but even that requires that the resolver be optimized so that it doesn’t have to join or query tables for data that wasn’t requested.

                    The conclusion, which can be summed up as, “Shell art is over,” is an overgeneralized aesthetic statement that doesn’t follow from the premises. Even if the trade-offs between design choices were weighed fully (which they weren’t), a fundamentally flawed implementation of one makes it a straw man argument.

                    1. 1

                      The Twitter app used to lag like hell on my old Thinkpad T450. At the very least, it’d kick my fan into overdrive.

                      1. 1

                        Yay for badly written apps :-p

                        Safari will notice when a page in the background is hogging the CPU, and either throttle or pause it after a while. It puts up a modal dialog on the tab telling you and letting you resume it. Hopefully it sends an email to the developer too (ha!)

                    1. 2

                      Dumb question; but why can’t we do something about GIL if it hurts parallelism? Maybe option to remove/disable it? I think it must’ve been done somewhere.

                      1. 14

                        One reason it is hard technologically is because at the moment: any operation that involves only a single Python bytecode op, OR any call into a C extension which doesn’t release the GIL or re-enter the Python interpreter, is atomic. (Re-entering the Python interpreter may release the GIL.)

                        This means all kinds of things are atomic operations in Python. Like dict reads/writes and list.append(), either of which may call malloc or realloc in the middle.

                        You can write many data race-y programs in Python that have well-defined (messy, but still well defined) semantics. I think nobody in the world has an idea of how much code there might be in the wild that (possibly accidentally) abuses this. So making data races be undefined behaviour would be quite a large backwards compatibility break, in my opinion.

                        You don’t want to “just” slap a mutex on every object because then the lock/unlock calls world kill performance.

                        I believe the PyPy developers are/were looking at shipping an STM implementation and the GILectomy fork involves a lot of cleverness of which I can remember no details.

                        1. 6

                          There have been (more than) a few experiments to remove the GIL in the past 20 years. To my knowledge they end up performing worse or being less safe.

                          There’s a new PEP to get a more granular GIL.

                          1. 10

                            There is an exciting new approach by Sam Gross (https://github.com/colesbury) who has made an extremely good NOGIL version of Python 3.9 (https://github.com/colesbury/nogil) It performs almost without any overhead on my 24 core MacPro test machine.

                            It is a sensational piece of work, especially as you mentions there have been so many other experiments. I know Sam has been approached by ThePSF. I am crossing my fingers and hope they will merge his code.

                            1. 9

                              I’ve been struggling with a Python performance issue today that I suspected might relate to the GIL.

                              Your comment here inspired me to try running my code against that nogil fork… and it worked! It fixed my problem! I’m stunned at how far along it is.

                              Details here: https://simonwillison.net/2022/Apr/29/nogil/

                            2. 6

                              They tend to perform worse on single threaded workloads. Probably not all, but I’m quite sure that several attempts, even rather naive ones, produced multi-threaded speed ups, but at the cost of being slower when running on a single thread.

                              Even ideas that succeeded to improve multi thread performance got shot down because the core team believes this (slower single core for fast multi core) is not an acceptable trade off

                              1. 4

                                IIRC the position was taken fairly early on by Guido that proposals to remove the GIL would not be accepted if they imposed slowdowns on single threaded Python on the order of… i think a cutoff of about 5% or 10% might have been suggested?

                                1. 1

                                  That’s kind of what I remember too.

                            3. 4

                              There are experiments underway, e.g. https://lukasz.langa.pl/5d044f91-49c1-4170-aed1-62b6763e6ad0/, and there have been previous attempts that failed.

                              1. 3

                                Because alegedly, the gain in safety is greater than that of efficiency of concurrency.

                                It is a reliable, albeit heavy handed, way of ensuring simple threaded code generally works without headaches. But yes, it does so by eroding the gains of multithreading to the point of questioning if it should exist at all. Arguably.

                                Some async libraries mimic the threading API while resoursing to lower level async primitives. Eventlet and gevent come to mind.

                                1. 2

                                  No, it’s about performance and a little bit about compatibility.

                                  Most Python programs are single-threaded, and removing the GIL would not cause most of those to want to become multi-threaded, since their average Python program’s workload is not something that benefits from being multi-threaded. And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.

                                  Secondarily, there would be a compatibility issue for things which relied on the GIL and can’t handle having the acquire/release turned into no-ops, but the performance issue is the big one.

                                  1. 2

                                    And basically every GIL removal attempt has caused performance regression for single-threaded Python programs. This has been declared unacceptable.

                                    Why does this happen?

                                    1. 5

                                      Most of the time when a GIL removal slows down single-threaded code, it’s because of the GC. Right now Python has a reference-counting GC that relies on the GIL to make incref/decref effectively atomic. Without a GIL they would have to be replaced by more cumbersome actually-atomic operations, and those operations would have to be used all the time, even in single-threaded programs.

                                      Swapping for another form of GC is also difficult because of the amount of existing extension code in C that already is built for the current reference-counting Python GC.

                                2. 2

                                  Because significant portions of the Python ecosystem are built with a GIL in mind, and would probably break the moment that GIL is removed. You’d essentially end up with another case of Python 2 vs Python 3, except now it’s a lot more difficult to change/debug everything.

                                  1. 2

                                    A heavy-handed approach is to use multiprocessing instead of multithreading. Then each subprocess gets its own independent GIL, although that creates a new problem of communicating across process boundaries.

                                  1. 5

                                    I wonder if they were aware of systemd’s existing dash encoding scheme.

                                      1. 1

                                        That’s it! Or at least that’s a tool to work with them.

                                        It’s domain specific, but for paths, having to escape dashes is a real pain in the bum :-)

                                    1. 3

                                      Funny thing, the only time we tried this, ca. 2014-2015 for OpenStack, we simply didn’t find anyone. IIRC the offer was pretty open-ended, like “we feel in this component are a few bugs or edge-cases or simply documented, and we’ve been poking at it, but we’re a bit lost now. Anyone got some an hour or two for paid support for consulting?” No takers, but I think in the end someone could point us in the right direction via some logfiles and descriptions anyway. But if your org is willing to do that, yes, worth a try.

                                      I do think it really depends on which project. OpenStack at that time was very much one of these mostly corporate-backed open source projects, where folks worked on it during their work hours. Then there’s also the tax thing, it’s kind of a meme meanwhile but I know enough people of several nationalities who’d be “I prefer helping you for free for a few hours instead of complicating my taxes twofold by taking a sum less than 1000 EUR for consulting”. Sad as it is.

                                      1. 8

                                        That feels a bit different to me.

                                        If you email me and say “I will pay you $XX money for a one hour commitment via Zoom giving a short talk snd answering questions from our engineers” it’s very easy for me to say yes.

                                        If you put out a call for pitches for paid support the terms of the engagement are a lot more open ended - now I have to think about negotiating a rate, and having an introductory conversation/interview to see if this is a good fit for both sides, and then doing custom work, and tracking hours and suchlike.

                                        On the surface an hour long zoom Q&A and a short consultancy/support gif may seem similar, but my hunch is that there is a meaningful difference between them in terms of friction.

                                        1. 2

                                          Agreed. Additionally, there is a status difference – the distinction between “expert” and “hired laborer”:

                                          “This company paid me $500 for me to be the speaker of honor in front of their devs for an hour”

                                          vs

                                          “This company paid me $500 to do janitorial cleanup work in their codebase”

                                          These are the extreme versions, but I think this difference is important, if only unconsciously.

                                          1. 4

                                            Especially if you also have a day job that prohibits you from working for other companies. Giving a talk would likely fly, moonlighting probably not, with your day job boss.

                                            1. 1

                                              “This company paid me $500 to do janitorial cleanup work in their codebase”

                                              If it’s for the same hour I’d take it. That’s quite a bit more than I’m making currently :)

                                          2. 2

                                            The tax thing probably depends on the country. In the UK, consulting income is easy to add if you are already completing a self-assessment form, but if your only income is from a salary then you don’t need to do this. And once you’ve completed a self-assessment form one year, it’s incredibly hard to stop (though last year they finally implemented a feature that I asked for 10 years ago and now pre-populate the form with the information that your employer provided to them, so you don’t need to copy things from pieces of papers into a web form). If you’re already doing some consulting, it’s trivial.

                                            The bigger problem may be non-compete clauses. If you’re an employee (this situation where taxes would be a problem) then your employer may not like you going and giving training courses to a potential competitor.

                                            1. 2

                                              I wonder if you could elaborate on a couple of your points:

                                              once you’ve completed a self-assessment form one year, it’s incredibly hard to stop

                                              I’m not sure what you mean here. As I understand it there are specific conditions under which you need to file a self assessment. If you don’t meet those conditions you don’t need to file (though you still can), regardless of what happened last year (although you might have to tell them you’re not filing).

                                              they finally implemented a feature that I asked for 10 years ago and now pre-populate the form with the information that your employer provided to them

                                              Interesting. I noticed this when I filed online for my latest return. If this didn’t exist before and they added it because you asked, then thanks, that’s very helpful!

                                              1. 2

                                                If you don’t meet those conditions you don’t need to file (though you still can), regardless of what happened last year (although you might have to tell them you’re not filing).

                                                The rules might have changed, it was quite a few years since I checked. Originally if you did SA once you had no way off. I think then if you filed SA without anything other than PAYE income for a few years then you could stop. You might be able to stop more easily now.

                                                Interesting. I noticed this when I filed online for my latest return. If this didn’t exist before and they added it because you asked, then thanks, that’s very helpful!

                                                Given that it took ten years, I doubt very much that there was a causal relationship between my complaint and their implementing the feature.

                                            2. 1

                                              In the Netherlands at least, there’s a rule that if you’re not considered to be an entrepreneur by the tax agency, you don’t have to file separate taxes. You just list it as “extra income” in the same part of your form where your regular employment income goes and you’re done.

                                              But the rules are super murky and it’s very unclear when exactly you qualify and when you don’t. And there’s no feedback, only fines when you accidentally/unknowingly break the rules. This is something that definitely intimidated me when I just got started with freelancing.

                                            1. 13

                                              tl;dr security is hard, authorization is hard, authentication is hard, authenticating non-humans is hard. Don’t despair. Use least-privileges, apply policies on both clients and resources, monitor usage of credentials, if you are forced to use hard-coded credentials let them only be able to call sts:AssumeRole and rotate them frequently.

                                              There’s a lot of discussion here about hard-coding AWS credentials into files, and how to authenticate and authorize non-human tools for AWS access. Here are my two cents.

                                              Firstly when it comes to AWS resources there are always 2 sides to the coin - the client and the resource. Your blog post talks about how to authorize a client to access an S3 bucket, but never mentions the IAM resource policy on the S3 bucket itself. This is because when you call boto3’s create_bucket function, which calls S3’s CreateBucket API, for backwards-compatibility reasons it creates a publicly-readable S3 bucket, which is why the news is full of “omg the cloud is insecure everyone’s personal information got leaked”. Please avoid creating publicly-readable S3 buckets; the console makes it extremely difficult to do this any more, but the API is just an API and will let you do it.

                                              AWS users who use S3 buckets should enable and run the Access Analyzer for S3, identify publicly accessible S3 buckets in their account, and take appropriate action. They may just be e.g. static websites, which is fine, or may store personally identifiable information (PII), which is not fine. Also consider whether your data should be encrypted-at-rest. with a KMS key that you own and control. or even a straight-up symmetric key you own. This is again defense-in-depth in case someone gets access to the S3 bucket.

                                              Secondly, if your goal is “give a user access to a particular bucket/prefix and only be able to use a subset of APIs”, instead of creating credentials you could use S3 access points. An S3 access point creates a completely new DNS endpoint https://[access_point_name]-[accountID].s3-accesspoint.[region].amazonaws.com and when users hit this DNS endpoint to perform options S3 enforces policies like “You can only call GetObject and PutObject” on your behalf. This is an easy way to enforce IAM resource policies on the S3-side instead, and you can create multiple S3 access points for a single bucket (unlike an S3 bucket policy).

                                              That just simplifies the authorization story, what am I allowed to do. But authentication, who am I, is always tricky. This DNS endpoint for an S3 access point is not a secret nor should it be, anyone can call it. If I am a human I can call e.g. AWS STS AssumeRoleWithWebIdentity, assume some role, and then on the S3 access point only allow access from that role. OK. But how do I know e.g. the backup script that is running nightly on your VPS is the backup script?

                                              This is simple to answer for e.g. an AWS EC2 instance. If you attach an IAM role to an AWS EC2 instance’s execution role, you delegate this problem to AWS. But this doesn’t help you, you want to access an S3 bucket from a VPS host. That’s fine, but then you need to solve the problem on a VPS host.

                                              • If you hard-code credentials on the VPS host that e.g. gives sts:AssumeRole permission for this new role, what happens if you accidentally version-control those credentials or the VPS provider steals those credentials? Maybe this is OK if you rotate these hard-coded credentials regularly and monitor AWS CloudTrail logs for who uses the credentials.
                                              • You could enforce that clients form certain IP ranges are allowed to assume a role but these is not a sufficient control.
                                              • Aha! Maybe an EC2 instance has permissions and you SSH to it before uploading files or something. But how do you provision and control access to the SSH private/public key pair?
                                              • Set up some secure service that you call over HTTP to get temporary credentials. But how do you authenticate your script with the secure service?

                                              It’s at this point that people throw up their hands and say “Authentication is hard! I’ll just put the credentials in a file and set a calendar reminder to rotate them once every 30 days, and set up some tools to analyze CloudTrail access logs”. For your threat model this may be fine.

                                              But maybe you can dive deeper into your threat model and think…hmmm. I’m backing up data from a VPS to S3. Surely this can be an append-only backup, and I only need to grant s3:PutObject permission to some role that the VPS can assume using STS. That way if someone steals the credentials the worst they can do is put more data in. I’d have to pay for it which sucks but they can’t read my data. Hmm, they could overwrite it? But I can set up object locks to prevent overwrites. etc.

                                              1. 2

                                                This is because when you call boto3’s create_bucket function, which calls S3’s CreateBucket API, for backwards-compatibility reasons it creates a publicly-readable S3 bucket, which is why the news is full of “omg the cloud is insecure everyone’s personal information got leaked”.

                                                In the code I’m using s3.create_bucket(...) without any extra options - https://github.com/simonw/s3-credentials/blob/0.3/s3_credentials/cli.py#L92-L100 - and, as far as I can tell, the resulting buckets are not public. I just tried creating one, uploaded a file to it and then attempted to access the file by URL and got a permission error: https://simonw-test-bucket-is-this-public.s3.amazonaws.com/yourfilename.csv - am I missing something here? I definitely don’t want to be creating public buckets by default!

                                                If you hard-code credentials on the VPS host that e.g. gives sts:AssumeRole permission for this new role, what happens if you accidentally version-control those credentials or the VPS provider steals those credentials?

                                                This is why I wanted separate per-bucket credentials in the first place: I want to minimize the damage someone could do with stolen credentials should they access them. Much rather my VPS provider steals credentials for a single bucket than for my entire account!

                                                But maybe you can dive deeper into your threat model and think…hmmm. I’m backing up data from a VPS to S3. Surely this can be an append-only backup, and I only need to grant s3:PutObject permission to some role that the VPS can assume using STS.

                                                That’s essentially what my s3-credentials create name-of-bucket --write-only option does - it creates a brand new user and applies this inline policy to them so that they can only write (with PutObject) to the specified bucket: https://github.com/simonw/s3-credentials/blob/0.3/s3_credentials/policies.py#L38-L48

                                                Object locks are interesting - I hadn’t seen those! I like the idea of using them to prevent leaked write-only credentials from being used to over-write previously written paths.

                                                Secondly, if your goal is “give a user access to a particular bucket/prefix and only be able to use a subset of APIs”, instead of creating credentials you could use S3 access points.

                                                Whoa, I had not seen those before - looks like they were only added in 2019. Shall investigate, thank you!

                                                I’d love to provide a link in the README to material people can read that has solid, easy-to-follow recommendations for the absolute best practices on this kind of stuff- but I’ve been hoping to run into useful information like that for years, and the best material still seems to show up in comments like this one!

                                                1. 1

                                                  am I missing something here? I definitely don’t want to be creating public buckets by default!

                                                  No, then I am wrong, when you call CreateBucket the bucket is not publicly-readable by default.

                                                  I’d love to provide a link in the README to material people can read that has solid, easy-to-follow recommendations for the absolute best practices on this kind of stuff- but I’ve been hoping to run into useful information like that for years, and the best material still seems to show up in comments like this one!

                                                  Maybe a blog post idea for me! It’s always tough putting your neck out giving prescriptive best practice advice because you may be wrong, or things may change. Also it’s a broad topic, it’s hard to focus and come up with narrow objectives.

                                                2. 1

                                                  As someone who spent a lot of this year on compliance audits and thinking about “least privilege in AWS”, this is such a detailed but also clear write up on this topic. Thank you for taking the time to write and post it.

                                                  PS: at first I wondered “who are you, who are so wise in the ways of science?”, but then I checked and saw you work at AWS, https://asim.ihsan.io/about/, and went 💡

                                                  1. 1

                                                    You’re welcome!

                                                    My views don’t represent AWS. And maybe I’m missing something obvious about authenticating non-human tools. Please correct or enlighten me, I’m always learning.

                                                  2. 1

                                                    This is a fantastically useful comment, thank you!

                                                    Do you mind if I quote bits of it in this issue thread? https://github.com/simonw/s3-credentials/issues/7

                                                    1. 2

                                                      Yes you can quote it parts of it, just add “These views don’t represent AWS” at the end.

                                                  1. 6

                                                    A word of caution: embedding the secrets in your apps is a big NO NO in aws land. I am afraid this solution goes exactly against what you are supposed to do, in the real case:

                                                    • you’ll create a role to access the bucket
                                                    • your app will call the STS (secure token service) assume_role_with_web_identity or similar ( after successful authentication, you can use openid or saml, or other federation if that’s your thing).
                                                    • THAT will give you a set of tokens you can use to deal with the bucket.

                                                    note that likely amplify is the easiest way to deal with this currently

                                                    1. 4

                                                      came here to post basically the same thing - storing credentials is the wrong approach, assume a role that’s scoped to exactly what it needs and leave secret issuance to STS.

                                                      1. 4

                                                        I don’t understand how I can build my projects against this. If I’m going to call assume role I need to have credentials that let me call that, right? So something needs to be stored somewhere.

                                                        Here are some examples of things I have built or want to build with S3:

                                                        • A backup script that runs nightly on a VPS via cron and sends data to S3. I want to set this up once and forget about it.
                                                        • A GitHub Actions workflow that runs on every commit, builds an asset of some sort and stores that in S3. This needs to work from stable credentials that are stored in GitHub’s secrets mechanism.
                                                        • A stateless web application deployed to Vercel that needs to be able to access assets in a private S3 bucket.
                                                        • Logging configuration: I want to use a tool like Papertrail and give it the ability to write gathered logs to an S3 bucket that I own

                                                        None of these cases feature an authenticated user session or any type - they all require me to create long lived credentials that I store in secrets.

                                                        Can I use assume role for these? If so, how?

                                                        1. 1

                                                          For GitHub Actions you can now use OIDC to assume a role rather than long-lived credentials: https://docs.github.com/en/actions/deployment/security-hardening-your-deployments

                                                          1. 1

                                                            That does look like a good option for GitHub Actions - where my code is running in a contest that has an authenticated session I can exchange for another token - but it doesn’t help for cron scripts or anything where I want my code to run in a situation that doesn’t have access to credentials that can be exchanged in that way.

                                                            Confession; I’ve read that GitHub documentation on OIDC a couple of times now and it gives me the e impression that actually implementing that would take me the best part of a day to figure out (mostly at the AWS end) - it seems really hard! I wish it wasn’t.

                                                        2. 1

                                                          There was an article that went around the other week about using AWS IoT to get temporary credentials for machines in a home lab: https://ideas.offby1.net/posts/automating-letsencrypt-route53-using-aws-iot.html

                                                        1. 2

                                                          I looked at the source code (before even reading the post) and I was extremely happy: more code should be written like this. Granted it’s a simple tool, but the code was simple and to the point. Most of my difficulties in understanding it came from not knowing much about S3 bucket credentials; I was never confused because the author used too many abstractions or tried to be clever in how they wrote the code. Congrats!

                                                          1. 1

                                                            Thanks! I code for myself in six months time assuming I’ve been working on other things and have forgotten every detail of the current project.

                                                          1. 2

                                                            Does it have configurable endpoints? I’m using wasabi s3, and might try this tool out.

                                                            1. 2

                                                              It doesn’t but it could do - adding that to the ticket: https://github.com/simonw/s3-credentials/issues/2#issuecomment-959554514

                                                              1. 1

                                                                Nice, thanks!

                                                            1. 1

                                                              I’m reading more and more about different use cases for datasette, which is what the author is using this for, mostly.

                                                              On the one hand, I want to install it and play with my data. I’m kind of busy though, and I can’t find the time to integrate I’ve datasette with my existing workflows. In example, I don’t want to deal with transferring my photos from Google photos to s3.

                                                              On the other hand, I also want to just start small and make my own little set of personal data mining tools scoped precisely to myself. But that would take even more time.

                                                              In the end I just never dig into it in any way and use the precious little free time to read a book.

                                                              1. 2

                                                                This is a constant challenge for me - I’m really keen on driving down the friction involved in trying out Datasette, because I have exactly the same problem exploring other projects myself.

                                                                The biggest improvements I’ve made here are the following:

                                                                I’m always looking for new ways I can make getting started easier! I have a beta version of a SaaS hosted version too, which I’m currently making some major upgrades to.

                                                                Would definitely welcome suggestions!

                                                              1. 3

                                                                There’s an absolutely brilliant hack in here.

                                                                Part of the goal is to allow existing poorly written third-party scripts - analytics stuff for example - to run in a sandbox in a web worker rather than jamming up the main thread.

                                                                Problem is, those scripts need to access the page DOM API.

                                                                Partytown uses a devious trick where a proxy object provides getters and setters which then make blocking API calls to a separate service worker, using the mostly-forgotten xhr.open(…, false) parameter.

                                                                It’s so clever!

                                                                1. 2

                                                                  Neat! Any reason not to put it in the app store so you don’t have to worry about updates etc?

                                                                  1. 2

                                                                    Great question! I haven’t even started looking into the feasibility of that yet. The app does some pretty weird things in order to support installable plugins, and I’m not even sure where to start on checking if they are App Store compatible or not.

                                                                    1. 2

                                                                      if it can (and does) run sandboxed and does not call any private APIs (maybe electron does internally? I don’t know), it’ll be fine.

                                                                      1. 2

                                                                        I need to fully understand what “runs sandboxed” means I think. I have it signed and notarized, but do I also need to opt into the “com.apple.security.app-sandbox” entitlement? I haven’t done that yet.

                                                                        I have a few research notes on the sandbox here: https://github.com/simonw/datasette-app/issues/31

                                                                        1. 3

                                                                          Yes. You have to opt into the sandbox. If it still runs fine (activity monitor has a column to indicate whether the process is sandboxed or not) and if electron doesn’t use any private API, you’re good

                                                                  1. 12

                                                                    I wrote up a bunch more details of how I bundled a full copy of Python inside my Electron app here: https://til.simonwillison.net/electron/python-inside-electron

                                                                    1. 9

                                                                      I’ve made a ton of progress since I wrote this post on Sunday. Most crucially, I’ve settled on a way of bundling Python that looks like it might work well: I’m now including a full Python 3.9 (using the standalone Python project) inside the macOS application bundle, then on first startup of the app I use that to create a virtual environment in ~/.datasette-app/venv into which I then install the Python web application.

                                                                      Crucially, this let me build an “install plugin” mine item which can install additional code into that virtual environment.

                                                                      More on this Twitter thread: https://twitter.com/simonw/status/1432871047691333634

                                                                      Next step: figure out how to sign it all!

                                                                      1. 3

                                                                        As one of your downstream maintainers in nixpkgs, I’m fascinated and horrified in equal parts by how much bundling you’ve done. We undo some bundling when possible, like Python-specific package management, so nixpkgs has to separately package each plugin. (Edit: And I just packaged another one.)

                                                                        But your attention to detail pays off; the bundles work very well. I just used datasette-publish-fly to publish a database on Fly. Within the bundling mechanism of Docker and pip, everything worked as documented, even on NixOS.

                                                                        I’m obligated somewhat to point out that nixpkgs has a very nice mechanism for bundling declarative multi-language packages into single outputs, from dockerTools to nix-bundle. It would be a lot of effort to rewrite what you’ve already built, but it might be worth examining anyway.

                                                                        1. 7

                                                                          This is great - I hadn’t seen the nix efforts before! Thanks so much for working on this.

                                                                          My goal with the Datasette desktop app stuff I’m working on is pretty much exclusively to solve the “I use macOS or windows and I want to use this software without having to learn any alternative form of packaging - or even what that word means” - the bundle I’m producing is pretty enormous (with both a full Python and a full Electron/Node/Chromium) so I fully expect more advanced users to pick alternative installation methods!

                                                                          I should figure out how to use Nix myself and get that added to the documentation at https://docs.datasette.io/en/stable/installation.html

                                                                      1. 3

                                                                        I mainly use reader mode as the “your sodding web fonts haven’t loaded yet and I’m impatient” button, or the “you implemented your paywall exclusively using JavaScript, didn’t you?” button.

                                                                        1. 18

                                                                          I’m so excited about this trend.

                                                                          Something I think the JavaScript development world doesn’t really take into account is the burden all of their build tooling puts on us occasional JavaScript developers.

                                                                          I mainly work in Python, but I work on JavaScript maybe once a month - which is just infrequently enough that I’ve never managed to build a permanent mental model of how all of the JS build tooling works.

                                                                          As a result, it feels like every single time I have to interact with Webpack (or its many peers) something has changed or I’ve forgotten a step and as a result stuff breaks.

                                                                          This makes revisiting older projects an exercise in frustration!

                                                                          I imagine if you are a full-time JavaScript developer immersed in that ecosystem this problem fades, since you work with it every day.

                                                                          1. 3

                                                                            I imagine if you are a full-time JavaScript developer immersed in that ecosystem this problem fades, since you work with it every day.

                                                                            It fades if you keep working in the same stack of tools atop JS. A new project often means the same hurdles, just with a contextual leg up. Even a new major version of a stack component like Next.JS can mean you need to cope with changes in other components they bake in but still affect you.

                                                                            And to bring it full circle, the unusually high depth of packages importing packages in NPM has to do with the fact that baseline JS offered very little in terms of standard features, so small libraries like left-pad would be imported and depended on instead of deemed trivial. Improve the baseline and you can cut out a lot.

                                                                            1. 2

                                                                              Yep. I have a personal site that I work on. It’s mostly static, with a few APIs and a few key generated files, with hand-written HTML and CSS and vanilla JS. It works fine, and it’s basically unbreakable, but I would like to add some new features. Those new features want new libraries. And those libraries are very “oh, just add this to your packer”. I don’t have a build process or any of the stuff that they expect me to have, I’ve been out of mainstream web dev for 10 years, and I don’t have the time to learn how to get from my current state (“works great”) to a new state of “works hopefully almost as good but is usable as a base for all this new shit” because it’s an unpaid side project.

                                                                            1. 2

                                                                              I implemented the equivalent of that trigger solution for SQLite in my sqlite-utils Python library back in January: https://sqlite-utils.datasette.io/en/stable/python-api.html#python-api-cached-table-counts

                                                                              1. 3

                                                                                SQLite special-cases “count(*)” to avoid a full scan. It still has to read each page of the table, but it just adds up the row-count stored in each leaf page’s header, so the CPU cost is near zero.

                                                                                1. 2

                                                                                  Huh, I hadn’t seen that! My experience is that a count(*) on a table with a few million rows in it still takes long enough that’s its useful to have an optimization though.

                                                                                  1. 1

                                                                                    I drew a blank looking it up in the SQLite docs just now, but I swear I read about this in some docs on the query optimizer.

                                                                                    (If it kept a count field with the pointers on the first-level interior nodes it would be much faster because it’d only have to read about 1/100 as many pages, but it probably wasn’t worth allocating ~100 bytes of the page for this specific optimization.)

                                                                              1. 15

                                                                                Python 3.9.1 and PostgreSQL 13 - those should do nicely for a few years for my projects that use them.

                                                                                Also SQLite 3.34.1: https://packages.debian.org/testing/database/sqlite3 - that’s pretty recent, from January this year.

                                                                                1. 2

                                                                                  This is what I’m looking forward the most. We couldn’t resist upgrading to pg13, but py3.9 is a much anticipated upgrade!

                                                                                1. 6

                                                                                  I am SO excited about this.

                                                                                  I’ve run the team responsible for maintaining the development environment tooling at a larger (100+ software engineers) company and the amount of time and money lost to engineers with broken local environments - or getting new hires spun up - or helping an engineer spin up a project they hadn’t worked with before - was astronomical.

                                                                                  Being able to provision a fresh, working development environment in just a few seconds is absolutely game-changing for large engineering team like that.

                                                                                  1. 5

                                                                                    This is such a clever extension of that brilliant hack to get WASM SQLite working against big static files using HTTP Range headers from a few months ago.

                                                                                    Really interesting performance numbers too, about how much better it performs than IndexedDB even though it wraps IndexedDB. Reminds me of https://www.sqlite.org/fasterthanfs.html

                                                                                    1. 1

                                                                                      It would be really interesting to combine these and use the IndexedDB back end for offline editing and local storage and something that uses HTTP range requests for server-side persistence.

                                                                                    1. 3

                                                                                      I love this library. I built a little web app that wraps it a while ago - it’s a really fun tool to play around with: https://ftfy.vercel.app/