Threads for kelp

    1. 9

      So I ran into almost exactly the same thing at Square around 2012. Our whole production network, including our firewalls were getting hammered and we didn’t know why. The growth in traffic was alarming and we thought it was due to huge growth in business.

      We even did a substantial network upgrade to mitigate it. But I did the math again and the traffic growth was on track to overwhelm our upgrades in a few months. We didn’t have great instrumentation at the time but we also saw that Redis (we only had one server and a replica, IIRC) was basically saturating its network.

      Me and another engineer finally sat down in a big conference room to figure it out. After a bit of tcpdump we realized there was a set that kept getting items added to it, and not cleaned up. It was several MB in size and we were pulling it for every API call. Napkin math and it added up to like 99% of the redis traffic we were seeing.

      We manually truncated it and the traffic instantly dropped. Then a small PR and deploy later. All fixed.

      This was a particularly crazy case. But I always feel that any unoptimized software system has at least one 10x perf improvement to he found, if you just look.

    2. 4

      I’m responsible for approving cpu/ram/storage increase requests from developers and stories like these do kinda make me wonder if I should be as lenient as I am.

      I pretty much approve every request because what else am I going to do? Scour the source code of every app for inefficiencies? I did do that once: someone who wanted 200GB of RAM just so they could load a huge CSV file instead of streaming it from disk.

      Maybe it’s just a thing where trust can be built up or torn down over time.

      1. 3

        Asking from ignorance here: I’ve never worked somewhere where you had to request cpu/ram/storage. Instances or VMs, yes, but not asking to have some more RAM and having to say how much up front. How is that managed? You have processes killing containers that use more RAM than the developer asked for? Or more CPU? And… why? Is it a fixed-hardware environment (eg not in cloud) where it’s really hard to scale up or down?

        1. 3

          Yes it’s fixed-hardware (grant funded), shared amongst several different teams. My role is mainly to prevent the tragedy of commons, and a little bureaucratic speed bump is the best I could think of.

      2. 1

        Figure out how much the hardware will cost, figure out how much developer time will equal that cost, and force them to spend at least that much time profiling and optimizing their app before the request is approved?

      3. 1

        Problem in this case is that redis is (essentially) single threaded, so give it as many cpus as you might, if something is eating it at a high rate, you’ll need to solve the root cause.

      4. 1

        I know several people who worked at Yahoo in the 00s. To get new hardware you’d have to go to a hardware board, that included David Filo.

        He would grill you and then literally login to your servers to check utilization and if things didn’t look good enough he’d deny your request. In one case I was told he logged in and found miss-configured RAID controllers that wasted a bunch of resources. Request denied.

        I’m not suggesting you do this. But thought it was interesting.

        1. 5

          What an utterly bananas way for the cofounder to spend their time: hire people they don’t trust then micromanage their decision making.

      5. 1

        If you don’t look at the reason behind the request, then the process seems weird… If you don’t know what the app is doing, how can you decide if it should be approved?

        1. 2

          I mean, only investigating when something unusual is requested seems like a pretty reasonable heuristic.

    3. 6

      I wonder if they’ll be using Oxide machines… sounds like they’re looking for what Oxide is selling

      1. 4

        Is Oxide actually shipping any racks yet?

        1. 8

          There’s a physical rack that visitors have taken photos of! This means they’re very close to the end of verification testing and moving into the Pilot stage

        2. 3

          They say by the end of the year, so in 2 months at most if all goes well.

    4. 2

      What about renting physical servers? Does anyone of this size do that anymore? Is it any cheaper than renting servers in AWS?

      The thing I don’t like is thinking about disk hardware and drivers and that sort of thing. I kind of like the VM abstraction, but without VMs maybe :)

      Of course some people want control over the disk, especially for databases …

      1. 8

        You can certainly rent bare metal. This is most of OVH and Hetzners business. Both of them do it really cheaply. Though Hetzner only does cloud servers in the US, their bare metal stuff is in Europe only.

        I find OVHs stuff to be pretty janky, so I’d hesitate to do much scale there. Equinix Metal is a pretty expensive, but probably quality option (I haven’t tried it).

        There still exists businesses everywhere on the spectrum from an empty room to put your servers in to “here is a server with an OS on it.”

        And even with your VM abstraction, someone has to worry about the disks and drivers and stuff. The decision is if you want to pay someone to do that for you, or do it yourself.

        For the last couple of years rented a single bare metal server from OVH and installed FreeBSD on it and just used it for hobby stuff. I used zfs for the filesystem and volume management and bhyve for VMs. I ran several VMs on it, various Linux distributions and OpenBSD. It worked great, very few issues. The server cost me about $60/month.

        But I eventually gave it up and just moved all that stuff to a combination of Hetzner cloud and Vulr because I didn’t want to deal with the maintenance.

      2. 4

        In my experience it is cheaper than renting from AWS once you need more than a certain threshold. That threshold has stuck, rather consistently, at “about half a rack” from where I’ve observed it over the past 10 years or so.

        1. 1

          OK interesting …

          So isn’t the problem is that for whatever reason, the market doesn’t have consistent pricing for physical servers? It might be LESS than AWS, but I think it’s also sometimes “call us and we’ll negotiate a price” ?

          I can see how that would turn off customers

          Looking at a provider that kelp mentioned, they do have pricing for on demand:

          And then it switches to “call us”:

          I’m probably not their customer, but that sorta annoys me …

          1. 3

            Yeah basically a lot of stuff in the datacenter and datacenter server/networking equipment space is “call us” for pricing.

            The next step past Equinix metal is to buy colocation space and put your own servers in it.

            And the discounts get stupid steep as your spend gets larger. Like I’ve seen 60-70% off list price for Juniper networking equipment. But this is when you’re spending millions of dollars a year.

            When you’re buying a rack at a time from HPE or Dell at $250K - $500K/rack (numbers from when I was doing this 5+ years ago) you can get them to knock off 20-40% or something.

            It can be pretty annoying because you have to go through this whole negotiation and you’ll greatly overpay unless you have experience or know people with experience to know what an actual fair price is.

            At enough scale you have a whole procurement team (I’ve hired and built one of these teams before) who’s whole job is to negotiate with your vendors to get the best prices over the long term.

            But if you’re doing much smaller scale, you can often get pretty good deals on one off servers from ProVantage or TigerDirect, but the prices jump around a LOT. It’s kind of like buying a Thinkpad direct from Lenovo where they are constantly having sales, but if you don’t hit the right sale, you’ll greatly overpay.

            Overall price transparency is not there.

            But this whole enterprise discount thing also exists with all the big cloud providers. Though you get into that at the $10s of millions per year in spend. With AWS you can negotiate a discount across almost all their products, and deeper discounts on some products. I’ve seen up to 30% off OnDemand EC2 instances. Other things like EBS they really won’t discount at all. I think they operate EBS basically at cost. And to get discounts on S3 you have to be storing many many PB.

            1. 3

              But this whole enterprise discount thing also exists with all the big cloud providers. Though you get into that at the $10s of millions per year in spend.

              AWS definitely has an interesting model that I’ve observed from both sides. In the small, they seem to like to funnel you into their EDP program that gives you a flat percentage off in exchange for an annual spend commitment. IME they like larger multi-year commitments as well, so you’ll get a better discount if you spend $6m/3 years than if you do three individual EDPs for $2m. But even then, they’ll start talking about enterprise discounts when you are willing to commit around $500k of spend, just don’t expect a big percentage ;)

              When I once worked for a company with a very large AWS cloud spend - think “enough to buy time during Andy Jassy’s big keynote” - EDPs stopped being flat and became much more customized. I remember deep discounts to bandwidth, which makes sense because that’s so high margin for them.

            2. 1

              It can be pretty annoying because you have to go through this whole negotiation and you’ll greatly overpay unless you have experience or know people with experience to know what an actual fair price is.

              This is a key bit that people don’t realize. When I worked for a large ISP and was helping spec a brand new deployment to run OpenStack/Kubernetes I was negotiating list price from $$MM down to $MM. Mostly by putting out requests for bids to the various entities that sell the gear (knowing they won’t all provide the same exact specs/CPU’s/hard drives), then comparing and contrasting, taking the cheapest 4 and making them compete for business.

              But its a lot of time and effort up front, and there has to be a ton of money handed over upfront. With the cloud that money is spent over time, rather than front loading it.

          2. 2

            I share your annoyance.

            I think the threshold has been pretty consistent if you think of it in terms of what percentage of a rack they need to sell… people who rent servers out by the “unit” drop below the AWS prices once you occupy around half a rack. And yes, I’ve had to call them to get that pricing.

            It’s a little annoying to have to call.

            And furthermore, things can be cheaper at different points in a hardware cycle, so it’s a moving target.

            I think some of it is down to people who peddle VMs being able to charge per “compute unit” but people who peddle servers (or fractions of servers) not being able to go quite that granular.

            If you rent in “server” units, you need to be prepared to constantly renegotiate.

      3. 2

        There are certainly businesses which are built on the idea that they are value-added datacenters, where the value is typically hardware leasing and maintenance, fast replacement of same, networking, and perhaps some managed services (NTP, DHCP, DNS, VLANs, firewalls, load balancers, proxies…)

      4. 1

        “Single VM on a physical host” is a thing I’ve seen (for similar reasons you mention: a common/standard abstraction), not sure how often it’s used at this sort of scale though.

      5. 1

        The thing I don’t like is thinking about disk hardware and drivers and that sort of thing. I kind of like the VM abstraction, but without VMs maybe :)

        I think you trade one thing for another. You have to think about other things. If you want you could also run your own VMs of course, but honestly, you just have an OS there and that’s it and if you want to not think about storage you can just run MinIO or SeaweedFS, etc. on some big storage server and add as you go. And if you rent a dedicated server and your server really happens to have disk failure (which is usually in a raid anyways) you just have it replaced and use your failover machine, like you’d use your failover if your compute instance starts to have issues.

        It’s not like AWS and others don’t have errors, it’s just that you see them differently and sometimes Amazon notices before you and they just start up that VM somewhere else (that works automated as well if you run your own VMs. That’s widely implemented technology and “old”). I see all of this as server errors, whether physical rented machine or virtual machine. In both situations I’d open some form of ticket and have the failover handle it. It’s not like cloud instances are magically immune and that stuff often breaks the abstraction as well. They might detect it, there however also is just a certain percentage of VMs/instances becoming unavailable or worse becoming half-unavailable, clearly doing something still despite it having been replaced and in my opinion that is a lot more annoying then knowing of any hardware thing, because with hardware issues you know how to react, with instances doing strange things it can be a very hard to verify anything at that layer of abstraction and eventually you will be passed along through AWS support. If you are lucky of course it’s enough to just replace it, but that is pretty much an option with hardware as well and dedicated hosting providers certainly go that route of automating all these things and are pretty much there. There’s hardware/bare metal clouds but to be fair, they are close, but still lag behind. Here I think in terms of infrastructure as code. I think slowly that’s coming to physical machines as well with the “bare metal cloud” topic. It just wasn’t the focus so much. I really hope that hosters keep pushing that and customers use and demand it. AWS datacenters becoming pretty much equivalent to “the internet” is scary.

        But it’s certainly not in an area where if you use compute instances (rather than something way even higher level, in the realms of Heroku,, etc.) that you created yourself it makes a huge difference. Same problem, different level of abstraction. Probably slightly more noticed on physical machines, because of the mentioned automated VM failover, that only works though in specific cases. In either case you will need someone that knows how the server works, be it virtual or physical.

    5. 21

      The second is when your load is highly irregular.

      That is something that maybe should be explained better, because I frequently see wrong assumptions here. It doesn’t mean “you intend to grow, potentially by a lot”, it also doesn’t mean “we have more/fewer users during weekdays/weekends/nighttime/…”. It means that you have such huge differences, that you cannot afford simply owning or leasing the resources (which also might be handy in case of DDOS or sudden load spikes) on a month to month basis. Furthermore you need to have the infrastructure (know-how, automation, applications actually supporting this) in in place and in shape in a way that doesn’t outweigh what you save scaling up and down. It also means that you cannot benefit as much from reserving an instance for some time.

      And then when you calculate through this and compare prices of owning (or renting with a pay per month subscriber) the resources the cloud infrastructure should be cheaper. This usually means you very frequently scale down by more than a factor of 10 and this cannot be planned by months or something. Even then classical/cheaper providers might have better suited short term options.

      I point that out, because somehow people seem to think it’s like that. Many people seem to be really out of touch on the expense side on raw (cloud/dedicated/hosing/vserver) hosting costs, and then they realize that cloud doesn’t mean you don’t need Ops people, and both devs and ops need to actually make these things work and so on. It even goes so far that people are in denial, when someone takes a closer look, sometimes even about things like that they have DevOps/SRE people that do nothing all day but making sure things work on the cloud. Whole expensive teams are ignored.

      Of course it all depends on usage, per-case situations, etc., but when people use AWS like a really expensive 2000s vServer hoster, with their Wordpress or off the shelf PHP-Webshop installed on a compute instance and act like they are benefiting and modern from using the cloud it gets bizarre. And when it gets more complex and companies it seems like they are just using “the cloud”, because they are told to, be it because it’s modern, best practice, claims to make things easier - it might, but sometimes also more complex. It feels like people using an electric can opener to slice bread, because they consider it more modern and convenient. And I say that as someone who earns his living with people using the cloud.

      It feels like the marketing worked too well.

      Like I said it all really depends on the use case, but I think it would be wise to get away from “you have to use the cloud, because it’s the future” or similar hand-wavy arguments. Of course the same is true for any other technology that is being hyped right now. But that would go off-topic. In other words: Take a step back from time to time and reflect on what you are doing really makes sense in practice.

      1. 15

        I was running a webpage for school snow days. Mostly zero traffic, except for snow days, when it goes up a lot. I found autoscaling worked poorly because the scaling takes ~10 minutes, during which time the existing servers would get overloaded and die. If I had it to do again, I would just make it a static site on S3 or something, so it didn’t need autoscaling or (expensive) overprovisioning because S3 is cheap overprovisioning.

        1. 1

          I had a similar experience with a charity website. For one week of the year it would see several million visitors but barely see 100k a month for the rest of the year. In this case we provisioned multiple high core count servers behind a load balancer and rode out the storm without any degradation to service.

          The majority of that website was static content and nowadays I would do the same as you, build it as a static website and chuck it on S3; the dynamic content could then be loaded via a very lightweight backend with extensive use of caching.

          1. 2

            When flopped, the rescue team basically had to resort to a static page with a Go app that acted as a gatekeeper to let traffic trickle in as resources allowed: Not a universally applicable solution (most customers won’t wait in a queue to come back) but the principles of why the original version flopped (brittle, overly-complex, hard to reason about capacity) are pretty applicable to other circumstances.

      2. 2

        the same is true for any other technology that is being hyped right now

        I think the key thing is to remember that “right now” just means currently, at the time of discussion/evaluation. It’s not like the the things being hyped in October 2022 are specifically overhyped way more than the things being hyped in 2015 were.

        1. 1

          Exactly. :)

      3. 1

        sometimes even about things like that they have DevOps/SRE people that do nothing all day but making sure things work on the cloud. Whole expensive teams are ignored.

        I mean, as “an SRE person”, “the cloud” means I’m building higher level abstractions and automating things rather than maintaining ansible scripts and minding hosts. It doesn’t suffice to look at whether or not cloud shops employ SREs, you have to look at the value they deliver (I posit that I’m working on more valuable things because some cloud provider takes care of much of the tedium). Another way to look at it would be to compare how many SREs we would need to employ to provide the same foundational services (with comparable SLAs, etc) and how does that hypothetical expense compare with our cloud bill.

      4. 1

        And it’s even harder to get this all right on something like AWS where to get the best pricing you need to commit to some level of reservations with SavingsPlan. But if you overbuy, you’re throwing away money, and if you underbuy, you’re also wasting money paying for OnDemand instances.

        It kind of works out if you have highly variable load, and can do SavingsPlan for your baseline, and then do OnDemand for the peaks. Or if your team is competent enough, do Spot to really save save some money.

        And then of course, there is always the risk that AWS just does’t have the instances you need when you need them. You can reserve capacity too, but then you’re locked into paying for it. Again, if you’re competent enough you can setup your app to use a different instance type depending on cost or availability… Then you just added even more complexity.

    6. 3

      This advice is accidentally good. AWS, in particular, is not a good public cloud; so, avoiding building upon AWS happens to be a good choice. For example, the diagram halfway down the page only makes sense because AWS’s Kubernetes offering is not very good; on any other public cloud with a Kubernetes offering, the choice to just use Kubernetes would be easy.

      I want to tilt your perspective a little. Imagine that you already have an existing footprint with N containers working in concert. After a planned deployment of a new feature, the footprint will have N+1 containers. What is the marginal cost of adding a new container to your existing infrastructure? I believe that your point is that N is small, even N=1, for startups and small businesses. In my experience, though, N has never been smaller than maybe N=5 (from my failed small business.)

      1. 3

        This really depends on how you define “good public cloud”.

        In 2020/2021 I was leading about 40 engineers responsible for a fairly well known cloud product that offers its products on AWS, GCP and Azure.

        We had roughly 1000 k8s clusters to look after. All using the cloud providers managed k8s service. While AWS is probably the most complex to setup, doing the least for you. For our use it probably worked the best.

        And they certainly have the best support. Azure being the worst support, they would give us actively harmful advice about operating k8s.

        1. 1

          Ignoring ethics for a moment, it’s worth remembering that AWS (2006) is one of the oldest public clouds with its API-driven design. In general, AWS is lacking because it has primacy. Many AWS products are clearly old, and we might imagine that the burden of supporting massive legacy customers is sufficient to delay the introduction of superior APIs. This is also visible in older cloud-like products like Google App Engine (2008) compared to Google Cloud Platform (2013); while both are API-driven, there is a clear ossification of App Engine APIs. Or, another good example is SoftLayer (2005) which was acquired under IBM’s banner and remixed into Bluemix (2013). In both cases, Kubernetes support would have not been possible on the older, less flexible stacks.

          Keep in mind that Kubernetes (2014) itself was released after many of these clouds were designed and in beta releases; it wasn’t obvious back then that we could permanently avoid cloud-vendor lock-in by working with a high-level vendor-neutral cluster orchestration API. Some public clouds joyously embraced Kubernetes (like GCP and Bluemix), while others took their time (like DigitalOcean and AWS).

          If we reconsider ethics, then Amazon profits from AWS, and Amazon directly commits human-rights and labor-rights abuses. Is this “good”?

    7. 19

      My thinking is:

      • at the low end, service are given away for free or below cost as a loss leader to gain customers
      • at the highest end, prices have to be close to competitive because big businesses will eat an N year $X million dollar migration if it saves $Y million annually
      • therefore, avoid being medium sized!
      1. 8

        Hmm. I wonder if all cloud sceptics think the cloud is obviously unsuitable for their scale, but just imagine it must be cost-effective at others?

        I’ve done a bunch of time at companies in the build-your-own-datacenter regime, and it doesn’t seem to make financial sense for them to use the cloud for anything, even if you assume you’ll pay a small fraction of list prices. (Sometimes they still did, but this was pretty transparently motivated by middle managers working on their resumés). I’ve always assumed that AWS works by attracting customers near the start of their life, and retaining them by being so deeply embedded into their architecture that it never seems worth the immediate pain of switching. After all, this is a world where no tech company can write software fast enough to satisfy their product department—they’re not going to want to stop making customer commitments for a year while they rewrite all their stuff to not depend on Amazon.

        So if, as OP is arguing, it doesn’t make much sense at small scales either, I wonder if the reasons for it are a bit subtler. One thought that’s crossed my mind before is that by paying Amazon for things one can sneak a bunch of stuff into the budget that’d never get past the bean counters otherwise. You could build an in-house network as fast as AWS’s, and then your developers wouldn’t ever have to worry about network topology, but you won’t be allowed to add a zero to your networking infra costs just so your developers can be lazy. Maybe paying Amazon is just a way to pamper programmers.

        1. 7

          To be clear, I am at the low end, so cloud is cost effective for me because I only pay a couple hundred per year for hosting. There’s very little money to be saved vs. what I’m paying now.

          One thought that’s crossed my mind before is that by paying Amazon for things one can sneak a bunch of stuff into the budget that’d never get past the bean counters otherwise.

          That was the primary motivation for me to use AWS at my last job. It was already billing, so I could do whatever I wanted without asking anyone.

        2. 4

          You basically hit on the reason. If you have huge demand from product for new features, then it makes sense to pay a cloud provider premium so you can dedicate engineering to your own product. In this situation, engineering time is probably more scarce than money.

          That said, the other nuance is if your workload is highly predictable, then build your own data center is gonna save you a lot of money. At the end of the day, AWS, etc are not getting that much better of a deal on Intel CPUs, RAM and other physical bits that make the cloud. Plus they need margin on top of that. So at some scale it can be done cheaper, if you can predict demand.

          If your workload is not very predictable then your data center is either going to be too big for utilization and you spend too much. Or you’re holding back product or new customer onboarding due to capacity. In those cases the cloud premium also makes sense for the flexibility.

        3. 2

          One thought that’s crossed my mind before is that by paying Amazon for things one can sneak a bunch of stuff into the budget that’d never get past the bean counters otherwise.

          Been there :-)

    8. 3

      This is super cool, thanks for making this. I could very much see this being use for an AWS Lambda like service, or an alternative container orchestration system. Though I haven’t through through the implications of the networking limitations outlined in the libkrun README.

    9. 11

      Which is why Mozilla Firefox is such a breath of fresh air, as it uses much less of your CPU while still delivering a fast browsing experience. So feel free to have as many tabs open as you want in Firefox, your device will barely feel the effects.

      I use Firefox everyday but let’s be real here.

      I am rather tab-phobic but a few contemporary websites and one modern app like Figma puts my Firefox into a swapping tailspin after a couple hours of use. This may be better than Chrome, but it feels like the bad old days of thrashing your hard drive cache.

      To remedy this, it seems the developers decided to unload tabs from memory. It has made Firefox more stable, but page refreshes come surprisingly frequently.

      1. 16

        I am rather tab-phobic but a few contemporary websites and one modern app like Figma puts my Firefox into a swapping tailspin after a couple hours of use.

        I don’t entirely disagree but maybe some of the fault here lies on the engineers who decided that writing a vector graphics editor should be written as a web page – it would use several orders of magnitude fewer resources if it were simply a native executable using native graphics APIs.

        There’s only so much that browsers can do to save engineers from their own bad decisions.

        1. 6

          Figma being in browser is a product decision much more than an engineering decision. And being in browser is their competitive advantage.

          It means people can be up and running without installing a client. Seamless collaboration and sharing. These are huge differentiators compared to most things out there that require a native client.

          Yeah, I hate resource waste as much as the next person. But using web tech gives tools like Figma a huge advantage and head start for collaboration vs native apps. But yes, at some cost in client (browser) resources and waste.

          1. 1

            Figma being available in the browser is a competitive advantage, yes, in that it facilitates easy initial on-boarding and reduces friction for getting a stakeholder to look at a diagram.

            But there’s zero competitive advantage for Figma only being available as a browser app – once customers are in the ecosystem there’s every reason for the heavy Figma users to want a better performing native client, even while the crappier web app remains available for the new user or occasional-use stakeholder.

            Figma sort-of recognizes this – they do make a desktop version available for the heavy user, but it’s just the same old repackaged webapp garbage. And limiting themselves to just repackaging the web app is not a “competitive advantage” decision so much as an engineering decision to favour never having to learn anything new ever (once you’ve got the JavaScript hammer, everything’s a nail) over maybe having to learn some new languages and acquire some new skills, which used to be considered a norm in this industry instead of something for engineers to fear and avoid at all costs.

            1. 3

              I’m friends with an early engineer at Figma who architected a lot of the system and knows all the history.

              They says if they had done a cross platform native app, it would have been nearly impossible to get the rendering engine to get exactly the same result across platforms. Even for the web, they had to write their own font render.

              Yes, a native app could be faster, but it’s a major tradeoff, collaboration, networking and distribution features, security sandbox, much of that is just given to you by the browser. With a native app you have to build it all yourself.

              They started with a native app and ended up switching to web only. They also write Ruby, Go and Rust.

              And the Figma app is written in C++, wasm.

    10. 4

      I feel ashamed asking this, but can someone point me in a good direction for knowing why I would pick (or not pick) bsd over a standard Linux like say Debian ? My Google searches have miserably failed in providing a decent unbiased/not marketing/gp3-generated recap.

      1. 11

        Keep in mind, BSD is not one thing. They all have a common lineage, but FreeBSD and NetBSD both split from 386BSD around 1993. OpenBSD split from NetBSD in 1995, and DragonFlyBSD split from FreeBSD in 2003.

        The first BSD release was in 1978, it was 15 years (1993) later when NetBSD and FreeBSD diverged, and it’s been 29 years since then. So there has been almost twice as much time since the BSDs went their separate ways.

        They each have their own philosophies and priorities, so it’s about which one aligns with yours.

        But I think there are a few things that tie them together.

        1. A base system that is a full operating system, user land and kernel, all developed in a single code base. This is the biggest difference with Linux, IMO. And it can mean a much more cohesive feeling system. And IMO BSD man pages are of higher quality than on Linux. You can have the whole source of the base system sitting on that system.

        2. A ports system that is fundamentally based on Makefiles and compiling from source. However they all now have pre-built binary packages that can be installed. That was not always the case, it used to be you always had to build ports from source.

        My own take on their differences from each other:

        FreeBSD still cares the most about being a good server and does have some big high scale use at places like Netflix, and was used heavily at Yahoo when they were still relevant. FreeBSD tends to maybe be the more pragmatic of the group, but that also can mean it’s a bit more messy. There is sometimes more than one way to do the same thing, even in the base system. They have advanced features like ZFS, and Bhyve for VMs. This can make for a pretty powerful hypervisor. This is here I use FreeBSD. FreeBSD probably has the most users of them all.

        OpenBSD tends to be my favorite. Some of their development practices can seem esoteric. To get a patch included, you mail a diff to their mailing lists, and they still use CVS. They care a lot about security and do a lot of innovation in that area. They care less about things like backwards compatibility, often breaking their ABI between releases. Their developers use OpenBSD as daily drivers, so if you run OpenBSD on a laptop that is used by the right developers, pretty much everything will just work. Their manpages are excellent, and if you take the time to read them you can often figure out how to do most things you need. There is typically only one way to do a thing, and they tend to aggressively remove code that isn’t well maintained. Like OpenBSD doesn’t support bluetooth because that code didn’t work well enough, and no one wanted to fix it. So they just removed it. By modern standards OpenBSD has pretty old filesystems, you’ll need to fsck on a crash, and their multi-processor support and performance still lags far behind FreeBSD or Linux. I generally find that OpenBSD feels substantially slower than Linux when run on the same laptop.

        NetBSD I haven’t used in a LONG time. But for ages their primary goal was portability. So they tended to run on many different types of hardware. I’m not sure if they have enough developers these days to keep a huge list of supported hardware though. They currently list 9 tier 1 architectures, where OpenBSD has 13. I think NetBSD still tends to be more used by academics.

        DragonFlyBSD I’ve never actually installed, but I remember the drama when Matt Dillon split from FreeBSD in 2003. Their main claim to fame is the HAMMER2 filesystem and a different approach to SMP from what FreeBSD was trying to do in the move from FreeBSD 4.0 to FreeBSD 5.0 (~2003)

        With all of the BSDs you’re going to have a little bit less software that works on it, though most things will be found in their ports collection. You’ll probably have a more cohesive system, but all the BSDs combined have a small fraction of the developers that work on just the Linux kernel.

        1. 1

          At least the last time I ran FreeBSD there were at least 2 different ways of keeping ports up to date, both which were confusing and under-documented. Maybe the situation is better now.

      2. 7

        I think it’s mostly a matter of personal preference. Here’s a list of reasons openbsd rocks: but for me, I prefer the consistency over time of OpenBSD, the fact that my personal workflows haven’t significantly changed in 15 years, and that the system seems to get faster with age (up to a point). Also, installing and upgrading are super easy.

      3. 4

        Long time OpenBSD developer here. I think @kelp’s reply is mostly accurate and as objective as possible for such an informal discussion.

        I will add a short personal anecdote to it. As he says, all my machines were running OpenBSD before the pandemic. In the past I kept my online meetings on my phone because OpenBSD is not yet equipped for that.

        Being a professor at the university, this new context meant that I had to also hold my courses online. This is more complicated than a plain online meeting so I to had to switch back to Linux after more than 15 years.

        The experience on Linux, production wise, has been so good that I switched all my machines over. Except my home server. I don’t mean just online video meetings and teaching, but also doing paperwork, system administration (not professionally, just my set of machines and part of the faculty infrastructure), and most importantly running my numerical simulations for research.

        Now that we are back to normal over here, I could switch back to my old setup but I am finding it really hard to convince my new self.

        This is just a personal experience that I tried to report as objectively as possible.

        On a more opinionated note, I think the trouble with BSDs is that there is no new blood coming, no new direction. Most of them are just catching up on Linux which is a hard effort involving a lot of people from the projects. It is very rare to find something truly innovative coming from here (think about something that the other projects would be rushing to pull over and integrate, just like the BSDs are doing with Linux).

        If nothing happens the gap will just widen.

        From my porting experience I can tell you that most open source userland programs are not even considering the BSDs. They assume, with no malevolence, that Linux will be the only target. There are Linuxisms everywhere that we have to patch around or adapt our libc to.

        To conclude, in my opinion, if you want to study and understand operating systems go with the BSDs, read their source, contribute to their projects. Everything is very well written and documented unlike Linux which is a mess and a very poor learning material. If you just want to use it for your day to day activities and you want an open source environment, then go with mainstream.

    11. 15

      AWS’ basic model is to charge very, very high sticker prices, and then make deals to discount them aggressively for customers who can negotiate (or for startups, or for spot instances, etc). GCP mostly charges sticker prices. I’m sure they would like to get to an AWS-like model, but they’re still pretty small and don’t have that much market power yet.

      1. 18

        This is one of my least favorite qualities of AWS, but it is a really important discussion point for cloud pricing. No customer of significant volume is paying sticker price for AWS services. GCP is looking for names to give discounts to so they can put you in their marketing. AWS will give discounts to just about anybody with more than $100k in annual cloud spend (and also put you in their marketing). Not sure where Azure falls on the pricing negotiation spectrum.

        1. 30

          It’s frustrating since one of the original promises of cloud was simple, transparent pricing. It hasn’t been that way for at least 5 years though.

          1. 14

            It’s actually been quite funny to see everything come full circle. A la carte pricing was a huge original selling point for cloud. Pay for what you use was seen as much more transparent, but that’s proven not to be the case since most orgs have no clue how much they use. Seeing more and more services pop up with flat monthly charges and how that’s now being claimed as more transparent than pay-as-you-go pricing has been an amusing 180.

          2. 4

            It’s better than the status quo before where everything was about getting on the phone with a sales rep and then a sales engineer and you had no idea what other companies/netops were getting unless you talked to them. But that’s not saying it’s a good situation. I wonder if there’s room for a cloud provider that is actually upfront about their costs with none of the behind-the-scenes negotiation silliness, but I’m hard-pressed to find how that would earn them money unless they either charge absurd prices for egress bandwidth or they end up hosting a unicorn which brings in some serious revenue.

      2. 3

        If you’re spending millions a year with GCP you can get discounts on various things. Especially if you’re spending millions per year with AWS and are willing to move millions of that to GCP and can show fast growth in spend.

        I’ve also seen 90% (yes 90%) discounts on GCP egress charges. But not sure if they are now backing away from that.

        As the article points out, AWS gouges you an insane amount on egress and are quite unwilling to discount it. I have seen some discounts on cross AZ traffic costs though.

      3. 3

        There are definitely some paying special price for GCP. You gotta be pretty big.

      4. 2

        And they’ll never get there, given that the entire company is built around the goal of never actually talking to customers. Goes against the grain of manual discounting.

    12. 14

      Is there any evidence at all that more efficient languages do anything other than induce additional demand, similar to adding more lanes to a highway? As much as I value Rust, I quickly became highly skeptical of the claims that started bouncing around the community pretty early on around efficiency somehow translating to meaningful high-level sustainability metrics. Having been privy to a number of internal usage studies at various large companies, I haven’t encountered a single case of an otherwise healthy company translating increased efficiency into actually lower aggregate energy usage.

      If AWS actually started using fewer servers, and Rust’s CPU efficiency could be shown to meaningfully contribute to that, this would be interesting. If AWS continues to use more and more servers every year, this is just some greenwashing propaganda and they are ultimately contributing to the likelihood of us having a severe population collapse in the next century more like the BAU2 model than merely a massive but softer population decline along the lines of the CT model. We are exceedingly unlikely to sustain population levels. The main question is: do we keep accepting companies like Amazon’s growth that is making sudden, catastrophic population loss much more likely?

      1. 5

        We’ve always had the Gates’ law offsetting the Moore’s law. That’s why computers don’t boot in a millisecond, and keyboard to screen latency is often worse than it was in the ‘80s.

        But the silver lining is that with a more efficient language we can get more useful work done for the same energy. We will use all of the energy, maybe even more (Jevon’s Paradox), but at least it will be spent on something else than garbage collection or dynamic type checks.

      2. 4

        I can tell you that I was part of an effort to rewrite a decent chunk of code from Python to C++, then to CUDA, to extract more performance when porting software from a high-power x86 device to a low-power ARM one. So the use case exists. This was definitely not in the server space though, I would love to hear the answer to this in a more general way.

        I’m not going to try to extrapolate Rust’s performance into population dynamics, but I agree with the starting point that AWS seems unlikely to encourage anything that results in them selling fewer products. But on the flip side if they keep the same number of physical servers but can sell more VM’s because those VM’s are more lightly loaded running Rust services than Python ones, then everyone wins.

      3. 3

        I’ve spent a big portion of the last 3+ years of my career working on cloud cost efficiency. Any time we cut cloud costs, we are increasing the business margins, and when we do that, the business wants to monitor and ensure we hold into those savings and increased margins.

        If you make your application more energy efficient, by what ever means, it’s also probably going to be more cost efficient. And the finance team is really going to want to hold onto those savings. So that is the counter balance against the induced demand that you’re worried about.

    13. 12

      One way to think about Kubernetes is that it is an attempt by AWS’ competitors to provide shared common higher-level services, so that AWS has less ability to lock-in its customers into its proprietary versions of these services.

      It’s not unlike how in the 90s, all the Unix vendors teamed up to share many components, so they could compete effectively against Windows.

      1. 6

        Yeah I agree. I just don’t think Kubernetes is actually very good. It’s operationally very complex and each of the big 3 providers have their own quirks and limits with their managed k8s service. At my previous employer we were responsible for many hundreds of k8s clusters, and had a team of 10 to keep k8s happy and add enough additional automation to keep it all running. The team probably needed to be twice that to really keep up.

        I keep wondering if there is an opportunity to make something better in the same area. Hashcorp is trying with Nomad. Though I don’t have any direct experience with Nomad to know if they are succeeding in making a better alternative. It integrates with the rest of their ecosystem, but separates concerns. Vault for secrets management and Consul for service discovery and stuff.

        1. 4

          This sounds like progress! OpenStack was bad, K8s is not very good, maybe a new contender will be acceptable, verging on decent. ;)

          1. 1

            I wish I could upvote this multiple times. (openstack flashback intensifies).

            More seriously, from what I heard k8s really seems to be a lot easier to handle than OpenStack and its contemporaries. We had a very small team and at times it felt like we needed 6 out of 12 people (in the whole tech department of the company) just to keep our infra running. I’ve not heard such horror stories with k8s.

          2. 1

            I want to know why nomad isn’t that “acceptable verging on decent” list

            1. 1

              I don’t know anything about nomad.

      2. 4

        That was sorta how I thought about OpenStack, but I get the impression that software wasn’t really good enough to run in production and resultingly fizzled out.

        Not quite the same though because OpenStack was trying to be an open-source thing at the same level as EC2 + ELB + EBS, rather than at a higher level?

        1. 3

          Now, I never actually deployed openstack, so I may not know what I’m talking about. But I always got the impression that Openstack was what you got when you had a committee made up of a bunch of large hardware vendors looking after their own interests. The result being fairly low quality, and high complexity.

          1. 2

            I didn’t personally either but I saw someone try and just bail out after, like, a week or two.

            1. 2

              I actually saw someone put significant resources into getting an Openstack installation to work. It was months and months for a single person, and the end result wasn’t extremely stable. It could have been made good enough with more people, but unfortunately at the same time, AWS with all its offerings was much much easier.

              Kubernetes seems like a marginally better design and implementation of the same architectural pattern: the vendor-neutral cluster monster.

              1. 1

                The problem usually was that you wanted some of this compartmentalization (and VMs) on premise, that’s why AWS was out. In our case we simply needed to be locally available because of special hardware in the DC. We thought about going into the cloud (and partly were), but in the end we still needed so much stuff locally that OpenStack was feasible (and not even Docker wasn’t, because of Multicast and a few other things iirc)

    14. 2

      I have the ThinkPad P1 Gen 3 with 4K screen, Intel i9-10885H, and Quadro T2000 Max-Q. It’s basically the same laptop as this review, but a Quadro instead of GeForce GPU. It’s basically maxed out across the board and I even added a 2nd SSD. Feels great to use, but battery life is not great, it requires a special charger with a special port to charge. Doing just about anything with it makes it warm / hot and the fans spool up quite loud. This happens in both Linux and Windows.

      I also have a MacBook Air with an M1. It doesn’t even have a fan, hardly ever gets even warm, beats the Thinkpad on all but the GPU portion of Geekbench. And feels subjectively faster at almost everything, the battery lasts all day, charges fast on standard USB-C (doesn’t have to be a huge wattage charger) and the laptop speakers sound better.

      I prefer the ThinkPad screen slightly, especially since it’s 2” larger. The ThinkPad keyboard is a bit nicer, but the MacBook Air keyboard is much improved over the abomination that Apple used to ship. My hatred for those keyboard was what got me on the ThinkPad train.

      I end up using the MacBook Air FAR FAR more, even though maybe I prefer Linux a little over macOS.

      When Apple ships a 14” or 16” MacBook Pro with >= 32GB of RAM it’s going be really hard to keep me using a Thinkpad for anything other than a bit of tinkering with Linux or OpenBSD (I also have a X1 Carbon Gen7 for OpenBSD).

    15. 1

      If you’re on OpenBSD and this is biting you, I guess the fix would be to patch the port to push it up to at least 1.8.1 which has this fixed upstream. Or, you can just build it yourself and use that version instead.

      Has Rachel submitted a patch to the ports? Seems like she hasn’t, and I don’t blame her. The OpenBSD project sets a high bar for contribution, in means of tolerance towards user-hostile tooling. Contributing to open source can be far more tiresome than fixing it for yourself, and I found OpenBSD even more taxing than other projects.

      It makes me sad as this phenomenon is one of the reasons classical open source made by the people for the people is dying, and large companies take over opensource with their PR-open sourced projects.

      1. 1

        Has Rachel submitted a patch to the ports? Seems like she hasn’t, and I don’t blame her.

        Don’t hold your breath. (That said, I’d happily to contribute to most projects, and OpenBSD’s process would require significantly more effort on my part.)

        1. 1

          Asking you and the parent comment.

          What is it about the OpenBSD process that you feel makes it so hard?

          It’s a bit harder than sending a PR on GitHub. And the quality expectations are high, so you need to read the docs and get an understanding of the process.

          But when I contributed some things to OpenBSD ports (updating versions in an existing port and it’s deps) I found everyone I interacted with to be very helpful, even when I was making dumb mistakes.

          1. 2
            • no easily searchable bug database with publicly available status discussion to know if anybody is working on it, what work and maybe dead-ends were hit. No, a mailing list is not a proper substitute for this.
            • everything is done in email with arcane formatting requirements.
            • the whole tooling is arcane to contemporary users. (CVS, specifically formatted email, etc)

            I have done my BSD contributions in the past when I had more time and willingness to go the extra mile for the sake of others. I no longer wish to use painful tools and workflows for the sake of others’ resistance to change. It is an extra burden.

            Don’t get me wrong, this is not only about OpenBSD. The same goes for Fedora for example, they have their own arcane tooling and processes, and so do lots of other projects. They have tools and documentation, but lots of docs are outdated, and for those not being constantly in the treadmill these are a lot of extra research and work which is “helped” by outdated docs etc. and it is a giant hill to climb to publish an update of a version number and re-run of the build script.

            1. 1

              Thanks, this is a good answer.

              It was nice to see Debian move to a Gitlab instance, and nice to see FreeBSD is finally moving to Git.

              But I suspect not much is going to change with OpenBSD, though maybe Got will improve things at some point.

        2. 1

          Oh. Now this is totally a different reason from what I was thinking about. (I personally don’t agree to her on this one. Still I don’t blame her, even if her different “political”/cultural stance would be her sole reason. People must accept that this is also a freedom of open source users.)

          Recently I have made up my mind to once again contribute more than I did in the past few years, and while my PRs were accepted, some still didn’t make it to a release, and the project has no testing release branch (which I also understand for a tiny project), thus compiling your own fork makes sense even this way. And this way contributing stuff stuff often gets left behind in the daily grind. On the other hand some other tiny contributions were accepted with such warmth and so quick response time that it felt really good.

    16. 52

      Over the past few years of my career, I was responsible for over $20M/year in physical infra spend. Colocation, network backbone, etc. And then 2 companies that were 100% cloud with over $20M/year in spend.

      When I was doing the physical infra, my team was managing roughly 75 racks of servers in 4 US datacenters, 2 on each cost, and an N+2 network backbone connecting them together. That roughly $20M/year counts both OpEx and CapEx, but not engineering costs. I haven’t done this in about 3 years, but for 6+ years in a row, I’d model out the physical infra costs vs AWS prices, at 3 year reserved pricing. Our infra always came out about 40% cheaper than buying from AWS for as apples to apples as I could get. Now I would model this with savings plan, and probably bake in some of what I know about the discounts you can get when you’re willing to sign a multi-year commit.

      That said, cost is not the only factor. Now bear in mind, my perspective is not 1 server, or 1 instance. It’s single-digit thousands. But here are a few tradeoffs to consider:

      1. Do you have the staff / skillset to manage physical datacenters and a network? In my experience you don’t need a huge team to be successful at this. I think I could do the above $20M/year, 75 rack scale, with 4-8 of the right people. Maybe even less. But you do have to be able to hire and retain those people. We also ended up having 1-2 people who did nothing but vendor management and logistics.

      2. Is your workload predictable? This is a key consideration. If you have a steady or highly predictable workload, owning your own equipment is almost always more cost-effective, even when considering that 4-8 person team you need to operate it at the scale I’ve done it at. But if you need new servers in a hurry, well, you basically can’t get them. It takes 6-8 weeks to get a rack built and then you have to have it shipped, installed, bolted down etc. All this takes scheduling and logistics. So you have to do substantial planning. That said, these days I also regularly run into issues where the big 3 cloud providers don’t have the gear either, and we have to work directly with them for capacity planning. So this problem doesn’t go away completely, once your scale is substantial enough it gets worse again, even with Cloud.

      If your workload is NOT predictable, or you have crazy fast growth. Deploying mostly or all cloud can make huge sense. Your tradeoff is you pay more, but you get a lot of agility for the privilege.

      1. Network costs are absolutely egregious on the cloud. Especially AWS. I’m not talking about a 2x, or 10x, markup. By my last estimate, AWS marks up their egress costs by roughly 200-300x their costs! This is based on my estimates of what it would take to buy the network transit and routers/switches you’d need to egress a handful of Gbps. I’m sure this is an intentional lockin strategy on their part. That said, I have heard rumors of quite deep discounts on the network if you spend enough $$$. We’re talking 3 digits million multi-year commits to get the really good discounts.

      2. My final point, and a major downside of cloud deployments, combined with a Service Ownership / DevOps model, is you can see your cloud costs grow to insane levels due to simple waste. Many engineering teams just don’t think about the costs. The Cloud makes lots of things seem “free” from a friction standpoint. So it’s very very easy to have a ton of resources running, racking up the bill. And then a lot of work to claw that back. You either need a set of gatekeepers, which I don’t love, because that ends up looking like an Ops team. Or you have to build a team to build cost visibility and attribution.

      On the physical infra side, people are forced to plan, forced to come ask for servers. And when the next set of racks aren’t arriving for 6 weeks, they have to get creative and find ways to squeeze more performance out of their existing applications. This can lead to more efficient use of infra. In the cloud world, just turn up more instances, and move on. The bill doesn’t come until next month.

      Lots of other thoughts in this area, but this got long already.

      As an aside, for my personal projects, I mostly do OVH dedicated servers. Cheap and they work well. Though their management console leaves much to be desired.

    17. 2

      The thing about no code of conduct being a benefit seems to come up somewhat regularly. I even see it show up on the OpenBSD lists. But this is really just a function of community size. A small enough community can be self governing with implicit social norms.

      But once it gets large enough, the possibility rises that you’ll have too many bad actors, so you need to start making the norms explicit. This is why you see a code of conduct in FreeBSD, the community is larger.

      1. 7

        Code of Conducts aren’t exhaustive lists of what is allowed, and not even exhaustive lists of what is not allowed. They provide a bunch of guidelines but in the end, they need to be filled with life through enforcement action that usually goes beyond the scope of what is written down (and that’s where the bickering about CoCs starts: is any given activity part of one of the forbidden actions or not?) - which makes the actual social norms implicit again.

        The main signal a CoC provides is that the community is willing to enforce some kind of standard, which is a useful signal. There are communities that explicitly avoid any kind of enforcement, and there are communities that demonstrate that willingness through means other than CoCs.

        1. 5

          I don’t automatically assume that a community without a CoC is not willing to enforce a minimal standard of decency. If I were to insult a maintainer, a co-contributor or bug-reporter, I wouldn’t be surprised to experience repercussions. Do others assume that because there’s no formal document, that you can just say whatever you want?

          Either way, it’s off-topic.

        2. 4

          Not being exhaustive is actually what is great about Code of Conducts. One of the interesting things about moderating online communities is that the more specific and defined your rules for participation are, the more room bad actors have to argue with you and cause trouble.

          If the rules for your website are extremely specific, bad actors will try to poke holes in that logic, find loopholes, and generally argue the details of the rules. However, if your rule for participation is simply “don’t be an asshole”, then you have a lot more room as a moderator to deal with bad actors without getting into the weeds about the specifics.

          The Tildes Code of Conduct is really great for moderating an online community, because it’s simple and vague enough for almost everyone to understand, but does not leave any footing for bad actors to try to argue that they didn’t technically break the rules.

          I think Code of Conducts are great, and honestly, most of the people I encounter who are against them tend to be… not pleasant to collaborate with.

          Regarding bickering about forbidden actions:

          Shut it down. If you are a moderator or maintainer and someone breaks the rules, ban them. If someone causes a stink about it, warn them, and then ban them too if necessary.

          I think online communities, especially large online communities, seem to be afflicted with this idea that people on the Internet have a right to be heard and to participate. That isn’t true. Operators of these communities are not and should not be beholden to anyone. If someone continuously makes the experience worse for others and refuses to do better, ban them and be done with it.

        3. 2

          From a POSIWID perspective, the things I have observed lead me to conclude that the purpose of CoCs (in business, opensource, and other community organisations) is to install additional levers that may only be operated by politically-powerful people, and provide little-to-no protection for the people they claim to protect. I have seen people booted from projects despite admission by the admins that no CoC violation occurred, and I have seen people close ranks around politically-powerful people who remain protected despite violating organisation/project/event CoCs.

      2. 3

        You seem to be equating a code of conduct with a willingness to ban bad actors. I think that’s a false equivalence.

        1. 2

          That was not my point. My point was that the need for a code of conduct is often due to community size. Smaller communities can be more self policing based on implicit norms. They certainly can and do ban or drive off bad actors

    18. 20

      TL;DR didn’t sanitize usernames which could contain “-“ making them parse as options to the authentication program. Exploiting this, username “-schallenge:passwd” allowed silent auth bypass because the passwd backend doesn’t require a challenge.

      Awesome find, great turnaround from Theo.

      1. 2


        It’s a modern marvel that people end up using web frameworks with automatic user data parsing and escaping for their websites, because if not so many places would have these kind of “game over” scenarios.

        1. 5

          Usernames in web applications are not easy, nor is there wide awareness of the problems or deployment of solutions.

          If you’re interested in learning more, I’ve gone on about this at some length.

      2. 1

        If memory serves right, there was an old login bug (circa ’99) that was the same sort of thing:


        Too slow I guess :)

      3. 1

        Is this specific to OpenWall users or is it applicable to OpenBSD in general?
        From title it looks like an authentication vulnerability in the OpenBSD core os.

        1. 1

          This is OpenBSD in general.

    19. 6

      I really liked this. Particularly the points about maintenance being just as important as building something new. It’s nice to see their philosophy articulated and how the Neovim team has put it in action. I loved the call out about fixing an issue being a O(1) cost while they impact is O(N*M) over all the users it reaches.

      Very much looking forward to seeing their roadmap realized.