Threads for skunkwerks

    1. 1

      Multi core OCaml is huge news we’ve waited a long time for this

    2. 2

      accepting a restricted cipher set, ed25519 and newer key types only, ipv6 or vpn only connections, makes a massive difference in reducing log spam. But putting spiped in front https://www.tarsnap.com/spiped.html is the clear winner and from most *nix setups it’s transparent with a simple config.

      1. 2

        Not sure why one would pick spiped over wireguard today, tbh. (the choice between spiped and ipsec is/was a different matter).

    3. 1

      cannot find the package for some reason But excited to hear this.

      ` root@fbsd1@i7R32G:/usr/home/v # uname -a FreeBSD fbsd1@i7R32G 13.1-RELEASE FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2232 GENERIC amd64

      root@fbsd1@i7R32G:/usr/home/v # pkg install podman-suite

      Updating FreeBSD repository catalogue...
      
      FreeBSD repository is up to date.
      
      All repositories are up to date.
      
      pkg: No packages available to install matching 'podman-suite' have been found in the repositories
      

      `

      1. 3

        You may need to be using /latest/ packages, not /quarterly/ branch, they’re definitely available here.

        # mkdir -p /usr/local/etc/pkg/repos/
        # sed -e s/quarterly/latest/ -e '/#/d' /etc/pkg/FreeBSD.conf | tee /usr/local/etc/pkg/repos/FreeBSD.conf
        # pkg update -r FreeBSD
        # pkg install -r FreeBSD podman-suite
        
    4. 1

      I think the scale and timeframe is fascinating.

      Mid October the threshold is 45 TERABYTES for unpaid accounts. Terabytes. agreed the scheduletef is aggressive but I think we can all agree this is taking the piss for a free tier service. Help my ebook collection of 20 years is still under 3GiB. All my repos together take up at most 10GiB which includes 7GiB of BSD project clones.

      45TiB is somebody taking the piss. Even 5 TiB is salty tears still. How w many “real users” fall into this category? Maybe open office is a legit example if they store every build artefact for every platform

    5. 1

      Are there plans for the Elixir compiler to leverage these optimizations at some point?

      1. 1

        Doesn’t Elixir generate BEAM bytecode? If so, I’d expect it to also benefit from this. That said, I was a big confused by reading the announcement because a lot of the things that they were talking about sounded a lot like things from the original HiPE paper, and I thought HiPE had been merged around 15 years ago.

        1. 1

          HIPE has been deprecated in favour of this approach

      2. 1

        Anything generating BEAM byte code benefits at runtime

    6. 2

      Anyone see any benchmarks for Hotwire? Mention of performance is conspicuously absent from DHH’s article and the Hotwire home page. This other article idealizes performance by implying updates take only 18ms, but that’s not under realistic traffic conditions and doesn’t include the DOM update itself, only the HTTP overhead.

      Generally speaking, the SPAs will perform better for anything but mostly static content, especially under heavy traffic. Hotwire sends requests to the server for every DOM update and wait for the server to render HTML. SPAs send requests to the server only as needed and perform DOM updates in the browser itself.

      LiveView takes a similar approach. I hear it performs OK except for memory scaling issues. But being built on Erlang gives it the benefit of being designed from the ground up to manage stateful connections concurrently. I suspect Hotwire performs more like Blazor (i.e. not very well). It seems like it might actually perform worse under heavy traffic since Hotwire doesn’t compile application logic to wasm so it can be run client-side like Blazor does.

      1. 2

        I think a big part of the answer is that although in theory a carefully written SPA could out-perform HTML over the wire, other constraints take things very far away from optimal. Compare, for example, Jira vs Github issues (or a blast from the past like Trac, which is still around, for example Django’s bug tracker https://code.djangoproject.com/query). The latter two both feel much lighter and you spend far less time waiting, despite both being server rendered HTML.

        Another example would be my current client’s custom admin SPA (Django backend) . Some pages are very slow for a whole bunch of reasons. I reimplemented a significant fraction using the Django admin (server rendered HTML, almost no javascript), in a tiny fraction of the development time and the result felt 10 times lighter. Unfortunately this project is too far gone to change track now.

        Some of the reasons are:

        • app structure means you do far more HTTP requests, and then effectively do a client side join of data that could have been processed on the server.
        • js libraries or components encourage loading all data so you can sort tables client side, for example, slowing down page load
        • browsers are really good at rendering HTML fast, and SPAs end up downgrading this performance.
        • visibility and optimisation of server side rendering performance is massively simpler.
        • a single dev is typically responsible for a page loading speed, as opposed to split frontend/backend teams, which makes a massive difference.
        1. 2

          Interesting. I didn’t realize Github used HTML over the wire. What’s their implementation? Hotwire? Something custom? I’m digging through their blog, but the only article I’ve found that’s remotely related is their transition from jQuery to the Web Components API, which only relates to relatively small interactions in widgets.

          I’m working on a .NET project now that uses async partials in a similar manner, but user interactions are noticeably slower than a comparable app I’d previously written in Vue. The more dynamic content in the partial, the longer it takes the server to render it. There may be some performance optimizations I’m missing and I admit to being a relative novice to C#. But, in general, SPA bashing is rarely supported by evidence.

          Let’s take your Jira example. I’m looking at a Lighthouse report and a performance flamegraph of the main issue page. To chalk their performance problems up to “client side join” doesn’t tell the whole story. It takes half a second just for the host page to load, never mind the content in it. They also made the unfortunate choice of implementing a 2.2 MB Markdown WYSIWYG editor in addition to another editor for their proprietary LWML. Github sensibly has only one LWML (GFM) and you have to click on a preview tab to see how it will be rendered. I think it’s fair to say that If you rewrote all of Jira’s features, it’d be a pig no matter how you did it.

          1. 3

            Meant to reply to this earlier, then the weekend happened!

            GitHub is basically a classic Ruby on Rail app - see https://github.blog/2019-09-09-running-github-on-rails-6-0/ and https://github.blog/2020-12-15-encapsulating-ruby-on-rails-views/ - using Web Components where they need it for enhanced UI. Open up web tools and you’ll see on most pages, the bulk of the page arrives as HTML from the server, and a few parts then load afterwards, also as HTML chunks. I’m guessing they have a custom JS implementation of this, it’s not that hard to do.

            I completely agree that the comparison I made is far from the whole story, but part of my point is that other decisions and factors often dominate. Also, once you’ve gone down the SPA route, slapping in another chunk of JS for extra functionality is the path of least resistance, and justified on the basis of “they’ll only need to download this once”. While if you have HTML over the wire, where every page load has to stand on its own, I think you are more cautious about what you add.

            I can’t comment on .NET technologies. I also agree that there are times when you simply must have the low latency of Javascript in the browser with async HTTP. I have an app exactly like this - one page is very demanding on the UI front, and complex too. It’s currently about 7000 lines of Elm, but I imagine that React would have probably worked OK too. But it would be terrible, both in terms of complexity and performance, with server-rendered HTML. But in my experience quite a lot of apps really don’t need the SPA. For that app, I just have the one page that is SPA-style (and it’s a critical page, user’s will spend 80% of their time), but the rest of the site, which contains a long tail of important functionality, is server-side HTML with some smatterings of JS and HTMX.

      2. 2

        I read that LiveView postmortem, and found it odd that none of the “lessons learned” included load testing (which AFAICT would have completely prevented the outage). Also, unbounded queues with no natural backpressure (aka Erlang mailboxes) are land mines waiting to explode.

      3. 2

        Looks like you skimmed the excellent post mortem there, liveview is used at scale already elsewhere. this is more of a pubsub design issue rather than an erlang vm or liveview issue per se. If you’re streaming updates faster than they can be consumed then you have problems anyway in any system. You need to find an alternative approach, which they did.

        1. 1

          Used at scale where? I’d genuinely like to see some data.

          The article I linked to was the first really in-depth one I’ve seen. I didn’t skim it, but it’s a fair point: If you want to provide live updates without refreshing, you’re going to have scaling challenges regardless.

          1. 1

            https://smartlogic.io/podcast/elixir-wizards/s7e2-jose/ 12:00 on - but you can’t take this as a cargo cult example of success. They are using Liveview “at scale” but with what problems? ~19:00 I don’t know the exact details of where the friction happened.

            Stress test already showed millions (2015) on a single server. https://fly.io/blog/how-we-got-to-liveview/ But I think this is micro/in-theory. I tweeted at Angel, I’m curious too.

    7. 8

      I’ve been working with a similar paradigm through Elixir + Phoenix + LiveView, and it works extremely well for my one-person side projects. I’m excited to see Hotwire become the default in Rails.

      I wonder, though, does anyone have experience working with technologies like these on very large apps? I can imagine it’s challenging to scale it to larger teams – even for my small projects, I’ve found myself missing the concept of full-featured UI components.

      1. 3

        I don’t think it’s quite there yet for UI components, but it is a very active area of development- see https://surface-ui.org/

    8. 5

      And finally the native gleam to beam compiler and tool is included! Free from the shackles of rebar!

    9. 5

      This data structure looks like the same one used in constructing the Merkle tree used by the Dat protocol.

      I hadn’t heard of Segment Trees as a general data structure before, but I’ve seen them used in CouchDB’s implementation: the b-tree backing a map/reduce view stores the intermediate reduced (aggregated) values inside its interior nodes, so the reduction of a subrange can be computed quickly. I always thought this was super clever. (I think Couchbase’s CouchStore does this too, but it was only a TBD feature in the early days of 2012 when I was still working on it.j

      1. 5

        Wow that Dat protocol document has amazingly good diagrams. It looks like they’re also using a range aggregation tree / segment tree structure, although it doesn’t look like the document specifies if they store it with pointers or some implicit layout.

        1. 3

          Dat is implemented in JavaScript, so it’s less likely they’re doing low-level optimizations like contiguous array storage. In their case the tree layout is more abstract, as a structure to generate deterministic digests, not as an optimization.

          You mentioned B-trees as being more efficient but harder to implement; I wonder if you could improve efficiency by simply increasing the fanout of your tree? For example, by making the branching factor 4 instead of 2. The algorithms would stay pretty similar, you just need to consider two bits at a time instead of one.

        2. 2

          I believe the merkle tree in-memory layout with parent hashes alternating with child hashes is called a binmap. I have seen these in a couple of places, https://scattered-thoughts.net/blog/2012/01/03/binmaps-compressed-bitmaps/ sadly link-rotted away, and https://github.com/gritzko/swift/blob/master/doc/binmaps-alenex.pdf which suggests the provenance was via TU-Delft.

    10. 1

      Quick question, what is $http_server_library?

      To be fair python is slow as hell and no one expects it to be fast. Between 2011 and 2017 there was this weird push to make fast python servers using uvloop like falcon, sanic, and other projects that don’t make sense from a performance perspective since doing anything besides a hello world with them means destroying performance.

      Using a server with python or any slow language is implicitly stating that you prefer some quality - say developer friendliness - over performance. That’s ok and all but you can’t assume that your system is better than anything else except for that quality.

      I use an asynchronous C++ single threaded server that I can run multiple procs of and which listen to the same port. They talk to FoundationDB asynchronously because it’s hooked into the event loop making batched calls. They call other services using libcurl asynchronously because it’s also hooked into the event loop. I get ~380,000 requests per second on a vm on my laptop with 4 cores, benchmarked with wrk on the same vm.

      Is it developer friendly? Not as friendly as django or ror but with spdlog and proper logging It’s easy enough to track everything while building upon what I have. I could run probably run all of Discord’s text based systems off of this setup with as many or fewer servers than they have.

      1. 1

        Final note: I didn’t mention what language or what library I used for this. To make my point, a friend did a similar test and got similar numbers with a different language/library. Then, just to make things even more ridiculous, he shoved it onto his Windows box and ran it from there. Even then, it did a stupid amount of traffic without breaking a sweat.

        I think the point of this was that, for the most part, you’re unlikely to need complex tooling to get decent performance out of any tool. You might not need Puma on top of a Ruby server, you might not need gunicorn/gevent on a Python server. Ruby/Python just by itself would likely be “good enough” without needing to tune it, or use external libraries.

      2. 1

        I’d be interested in hearing more about this setup. I’ve started dabbling in FDB so seeing some more advanced usage would be great. What can you share wrt code and experiences?

    11. 2

      https://packet.com/ are excellent bare metal hosters with BGP support and cloud style provisioining so you get the best of both worlds. zero to tin in a couple of minutes and billing by the hour. I’ve also had good experience with netactuate.com who sometimes have older h/w servers available at lower rates but their provisioning isnt as slick as packet.

      Also cool product idea good luck with the launch

    12. 31

      The Dell XPS series has a firmware so bad that its engineers should be strung up in the town square for building it

      Perhaps this is nitpicking, but language like this really rubs me the wrong way. It’s short sighted because it assumes it’s all the engineers fault. It’s the kind of language I might expect from somebody with zero people skills and new in the industry, not from somebody who has been around for a while. There’s no place and time where suggesting we hang people because of their work should be acceptable.

      Setting that aside, I don’t understand what the point of this post is. It’s literally just a rant about laptops, but there’s no conclusion or anything. That’s of course fine for a personal blog, but I think such content does not belong on lobste.rs. I flagged the post for this reason.

      In terms of laptops, the X1 Carbon series is pretty good. Support is a bit iffy here and there (e.g. the microphone does not work until Linux 5.5), but this is true for pretty much any laptop that came out in the last two years or so. I had a X1 Carbon 3rd generation that worked perfectly, and recently replaced it with a Gen 7 since my Gen 3 was due for a replacement. They’re a bit expensive, but the X1 series is a good series.

      1. 5

        language like this really rubs me the wrong way

        Oh, please! This is obviously an over the top exaggeration used as a rhetorical device. Nobody is asking to kill anybody here. This is a common device in the English language, used often for fun, that even a non-native speaker as me was not confused about.

        1. 13

          This is essentially the same as saying “It’s just a prank!”, which is about the worst excuse for anything.

          1. 5

            No. It is just colorful language, and perfectly appropriate for a personal, light-hearted, blog post.

            1. 3

              No. The 90s wants its Torvalds back. This is never appropriate. Even if you’re joking. It’s a personal attack whether it’s a joke or not. Imagine being on the receiving end of this. Imagine walking up to one of the XPS engineers and saying this to their face!

              This blog post isn’t light-hearted - it’s full of spite - and “personal” is at its limits when you’re a high-profile developer publishing something on the Internet. So, overall, no.

              1. 4

                No. The 50s wants its censure back. Fortunately, Monty Python showed us that it is ok to say “fuck” in TV, even on a funeral, and to mock religion. Regardless of whether some people is offended.

                1. 5

                  There is a huge difference between saying “fuck”, mocking religion, and suggesting that we hang people (and for laptops out of all things).

                  I’m also unsure where you see the censorship here. Nobody is telling Drew he can’t share his opinion. But just as Drew is free to share his opinion, so are others free to hold him accountable for that; especially when he suggests we physically attack a group of people.

                  This brings me to something important and often misunderstood: the right to free speech does not give you the right to say whatever you want without repercussions. Instead, it simply means the government can’t prosecute you for expressing an opinion within the boundaries of the law. I’m pretty sure that suggesting we hang people is not only tasteless, but potentially also outside of the boundaries of free speech.

            2. 1

              how is this blog post light-hearted, it’s called “fuck laptops”

              1. 6

                how is this blog post light-hearted, it’s called “fuck laptops”

                It is light-hearted precisely because it is titled “fuck laptops”. The profanity right at the title is a clear indicator that the content of the post is not going to be extremely serious, and it will use a certain amount of hyperbole. When you say that “you are dying to go to that restaurant” nobody in their right mind is going to call a suicide line. Likewise, if I say that you should be tarred and feathered for misunderstanding such an obvious joke, nobody is going to accuse me of hate crime, death threat or intimidation.

                1. 6

                  Do you find it in the least bit strange that, in the face of multiple commenters disagreeing with your disagreement with one of the most upvoted comments on this post, your argument consists of statements like “Oh, please! This is obviously . . .”, “even a non-native speaker as me was not confused about”, “the title is a clear indicator that the content of the post is not going to be extremely serious”, and an analogy to “such an obvious joke”?

                  Doesn’t it seem like your argument that “it’s obvious” isn’t likely? If the case you’re stating was as obvious to others as it is to yourself, you wouldn’t have to make the case to so many different commenters as well as upvoters.

                  Just to be clear, I’m not saying that Drew should or should not use the rhetorical style that he did. I think he has a fair point when he says that he doesn’t post this kind of thing to lobsters and he’s just writing for himself. tptacek made a similar point about his writing on HN – he feels limited in what he can write since any random thought he posts to his blog will make it to HN.

                  1. 3

                    Isn’t it obvious in this case that your argument that “it’s obvious” cannot possibly be correct?

                    I guess everybody understood the joke, including some people who just wanted to make a fuss about it.

                2. 1

                  Swearing in a blog post is not a universally-understood signal that its contents are not supposed to be taken seriously.

        2. 4

          The point was also that the language was used to make engineers look bad without knowing the circumstances.

          Overall the tone in the post is unfriendly and offensive, a bit more than necessary for a rant.

      2. 3

        My 2016 or 2017 era XPS13 model 9360 no touchscreen is perfect.

        • kensington lock so i can take a pee at a conference without needing to carry my laptop in like a weirdo.
        • sleep on screen shut and resume wokrs and has done since day 1
        • 2 usb A ports & a usb c port that can drive external display and GB network
        • onsite repair warranty seriously this was amazing when they came round and replaced the keyboard -all day battery use while coding and sysadmin if i dont crank brightness to full -dreaded coil whine never bothered me
        • has gone completely in bios and video driver update
        • all of the above works on FreeBSD its my daily laptop except the SD card
        • i replaced whatever wifi it came with an intel 8265 which is adequate

        Pity the whiners are banging on Drew. Write your own display drivers then. Its his blog so whatever its hardly controversial and the exaggeration is not imo excessive.

      3. 2

        Perhaps this is nitpicking, but language like this really rubs me the wrong way. It’s short sighted because it assumes it’s all the engineers fault. It’s the kind of language I might expect from somebody with zero people skills and new in the industry, not from somebody who has been around for a while. There’s no place and time where suggesting we hang people because of their work should be acceptable.

        This. For what it’s worth I agree.

    13. 0

      Elixir finally learned how to sort dates. Maybe they’ll implement strftime in a few minor versions next.

      1. 11

        I know you were trying for a chuckle (which rings a bit empty, considering the complexity of the topic), but here you go, it’s scheduled for Elixir 1.11: https://github.com/dashbitco/nimble_strftime

        The discussions that birthed it: https://elixirforum.com/t/proposal-strftime-based-calendar-datetime-formatting/18734/35 and https://elixirforum.com/t/how-to-support-multiple-week-calendars-in-elixir/18783

        1. 1

          Ah, awesome! From the mailing list discussions back in the day I got the impression that core didn’t care.

      2. 2

        A tough call - having proper macros and native distributed programming functionality without risk of buffer overflow is a reasonable tradeoff. Let me know when c stdlib supports that.

        1. 4

          I think you may misunderstand what I’m getting at. That routine exists in many forms in almost all languages, and has amazing utility whenever time is involved.

          Programmers use time more often than they use macros or build distributed systems.

    14. 3

      Can any lobsters using HTTPie explain what drew them away from curl or what about curl pushed them to HTTPie?

      1. 8

        I haven’t been using it for long but for me the nicest thing so far is being able to see the whole response: headers, body, and all of it syntax-highlighted by default. The command-line UI is a little nicer as well, more clear and intuitive.

        It will probably not replace my use of curl in scripts for automation, nor will it replace my use of wget to fetch files.

        Now if someone took this and built an insomnia-like HTTP client usable from a terminal window, then we’d really have something cool.

        1. 1

          I’m guessing you mean this Insomnia. Looks cool. Good example of an OSS product, too, given most features people would want are in free one.

      2. 4

        I use both depending on circumstance (more complex use cases are better suited for curl IMO), but the significantly simpler, shortened syntax for HTTPie as well as the pretty printing + colorization by default for JSON APIs is pretty nice.

      3. 3

        I wouldn’t say I’d been ‘pushed away’ from curl, I still use curl and wget regularly, but httpie’s simpler syntax for request data and automatic coloring and formatting of JSON responses makes it a great way to make quick API calls.

      4. 3

        I like short :8080 for local host syntax.

      5. 3

        It’s all in how you like to work. Personally I enjoy having an interactive CLI with help and the like, and the ability to build complex queries piecemeal in the interactive environment.

      6. 3

        Sensible defaults and configurability.

      7. 2

        I need a command line HTTP client rarely enough that I never managed to learn curl command line flags. I always have to check the manual page, and it always takes me a while to find what I want there. I can do basic operations with HTTPie without thinking twice and the bits I need a refresher on — usually the syntaxes for specifying query parameters, form fields or JSON object fields — are super fast to locate in http --help.

      8. 1

        curl is the gold standard for displaying almost anything including tls and cert negotiation. i use bat mostly now though for coloured output and reasonable json support. https://github.com/astaxie/bat

    15. 3

      the irony is that many crustaceans desperately need to read this article and bear it in mind.

      1. 3

        Yeah we’re not HackerNews, but I’ve seen plenty of right over kind here. Definitely should be read by everyone. Repeatedly.

      2. 2

        I still think we’re better than some other people in the same space:

        https://news.ycombinator.com/item?id=21494483

        I personally try to call out people who are more interested in being right than being kind.

    16. 1

      Would be interested to hear why they decided to whip up a new language it must have some salient features

      1. 1

        I think Dfinity supposed to compete against Ethereum, so presumably it needs distributed features. And anything is better than Solidity.

        1. 2

          We’re not competing with Ethereum. But you’re right about the distributed features part – we needed an approachable language that has the right semantics for the model DFINITY network exposes.

    17. 3

      Who generates the key in this case - the host or the token?

      1. 4

        Token.

        1. 4

          Thx. I’d much more prefer to generate it on an (offline) computer.

          1. 3

            On the contrary, generating on the token is safer (if it’s implemented correctly — YubiKey had a bug in a chip once), since the key can’t be extracted from it.

            1. 3

              YubiKey had a bug in a chip once

              Not just once.

              That’s why I think an air-gapped computer running an open source crypto implementation is better.

            2. 3

              I think one does not exclude the other. E.g. ESP has eFuses for storing encryption keys, which can be read-projected (only readable by the hardware encryption support): https://github.com/espressif/esptool/wiki/espefuse

              There are probably more secure elements that support this mode of operation.

            3. 3

              there is some concern that one has to “trust” Infineon, the maker of the cpu/chip that:

              • there is no NIST/NSA style backdoor for their generated keys
              • there is no way to exfiltrate or extract a private key without your knowledge (e.g. confiscated & copied by evil agent at airport security, but on-key validation doesn’t show that this happened

              That said, I’m happier with an ECC yubikey than a filesystem password protected private key.

    18. 1

      epic hacking how long did this take you to do?

      1. 2

        uh, good question. I don’t track time :) but roughtly

        • debugging the i2c-hid driver bug and the screen brightness thing took a few days of occasional poking at things
        • the TPM i2c driver took a couple days maybe
        • the little ACPI things (keyboard backlight, tablet mode switch) took a couple hours, trivial stuff
    19. 37

      Because I’d rather admin a CA, manage cert signing, handle revocation (how does this get pushed out to servers?), and all that jazz, more than running some ansible scripts? Wait.. No, I wouldn’t.

      1. 11

        Hah. I thought about this a lot when I read this article.

        I think plenty of companies grow organically from a couple of dudes and as many servers, and before you know it you have 3 branch offices and 2 datacenters and a bunch of contractors, and it’s all well and good when everyone sort of trusts each other but then you get purchased and SOX’d and you have to scramble to make sure Larry who quit 3 years ago doesn’t have root on production still…

        I assume your ansible scripts are well documented, and are run when you’re on vacation? ;)

        I thought this article made a bunch of good points. Of course it’s an advertorial, but there’s enough meat in there to be interesting.

        1. 6

          I think plenty of companies grow organically from a couple of dudes and as many servers, and before you know it you have 3 branch offices and 2 datacenters and a bunch of contractors, and it’s all well and good when everyone sort of trusts each others but then you get purchased and SOX’d and you have to scramble to make sure Larry who quit 3 years ago doesn’t have root on production still…

          Precisely this. My team went from 2 DCs with maybe a few dozen machines between them to 6 DCs in various stages of commission/deccommision/use and hundreds (probably just over 1000) machines to manage. Running an ansible script to update creds on hundreds of machines takes a very long time even on a powerful runner. We’re moving to a cert-based setup and for the machines where it’s enabled it’s incredibly quick, lets us do key rotation more efficiently, and is just generally a huge improvement. It’s an economy of scale problem, as most are, ansible was fine when it was a couple of us, but not even at our relatively small Xe3 scale. I can’t imagine trying to do that on larger scales. Managing a few servers for CA and so on is a dream comparatively.

          1. 3

            What do you do with hundreds of machines?

            1. 2

              Currently? We wait.

              In the hopefully near future – something like OP

              EDIT: I feel like the brevity may be interpreted as snark, so I’m going to add some details to mitigate that as it wasn’t intended. :)

              Right now it takes a weekend or so to fully update everything, we mitigate some of it by running the job in stages (running only on pre-prod environments by product, only legacy machines, etc) It works out to running the same job a couple dozen times. That bit is automated. The real killer is the overhead of executing that many SSH connections from a single machine, basically. Running it in smaller chunks does mean we have a not entirely consistent environment for a while, but it’s pretty quick to run the job on a single machine if it fails or was missed. The runner has got flames painted on the side which helps, but it’s still quite slow.

              I think this is probably representative of a big disadvantage that Ansible has compared to something agent-based like Chef or Puppet, on some level I’m okay with that though because I think Chef/Puppet would just hide the underlying issue that direct key management is a little fraught.

              1. 3

                This is why I switched from Ansible to Saltstack - deploys are fast and it has a similar feel and structure as Ansible.

                1. 1

                  So to piggy back on SaltStack, it’s also neat because you can do a distributed setup of multiple Masters.

                  Makes it even faster for large fleets to roll out changes as each master manages a subset of the fleet with a salt master then farming out tasks to the other Masters to farm out to the minions/hosts.

              2. 2

                Another option may be to use a PAM module that updates the user’s authorized_keys file (from a central repo, such as LDAP) on attempts to lookup an account.

                I’ve done this in the past and it worked out okay for largish deployments.

                1. 2

                  You don’t need to update the key file on disk from ldap, you can use ldap to produce the contents of the key file directly.

                  https://man.openbsd.org/sshd_config#AuthorizedKeysCommand

                  https://github.com/AppliedTrust/goklp

                  1. 1

                    Also an option, but you need to ensure that there is a timeout and caching, etc as well. Updating the on-disk copy has this trivial and built-in (respectively)

                    1. 2

                      sssd does all that, and more

              3. 1

                Gah, sorry, let me rephrase: what sort of workload is it?

                (also, why not kerberos or something similar?)

                1. 2

                  I added an edit. As for kerberos, I just found this idea first – there was a FB article about it I came across a while ago (last year sometime, before this became a real problem), and started pushing for it. I work for an International BeheMoth, so changing things can be slow.

          2. 1

            I’ve reached this point too - considering moving the base stuff to either an os pkg and/or to use something like cfengine to distribute these faster than what ansible does. As an interim stage, I have a git pull-based ansible run on each box for the core, but I would prefer something that is more “reportable” than manually collating the status of packages on each system. Either way, I’m keen to store the CA info in an OS package, as a faster way to get boxes set up and updated.

          3. 1

            Precisely this. My team went from 2 DCs with maybe a few dozen machines between them to 6 DCs in various stages of commission/deccommision/use and hundreds (probably just over 1000) machines to manage. Running an ansible script to update creds on hundreds of machines takes a very long time even on a powerful runner.

            this is why you can keep your public key in a kind of centralised store, say, an LDAP server, and disable local storage of public keys entirely; sssd supports this model very nicely.

            (what irks me a bit about the advertorial above is that it conflates using host certificates and user certificates; and you can have one without another)

        2. 3

          I’ve managed ldap systems to handle distributed ssh / user authentication. I have less fear of that than anything CA related. I think its because OpenSSL taught me that all the tooling around it is terrible. Though I feel that Vault and other tooling is changing that slowly.

        3. 2

          Probably about as well as crl’s get pushed out to server fleets, and accounts are actually deleted along with certificates revoked. Eg. Not bloody likely. ;)

          1. 1

            I think for every sysadmin who knows their sh*t, there are 10 who don’t. This article is meant for them.

            1. 2

              Fair enough; this probably also makes more sense for large (or very large) companies with a full team of ops/secops managing fleets of servers, coupled with some type of SSO solution (as mentioned in the article).

              1. 3

                I estimate that this becomes a problem once you surpass the the fact that more than 3 users need SSH access and have more than 30 machines accepting SSH-connections.

                Below that, it’s probably not worth the effort, but the moment you reach those numbers you will probably continue to grow beyond that rapidly and it’s still possible to make the change with relative ease.