Pro tip: this applies to you if you’re a business too. Kubernetes is a problem as much as it is a solution.
Uptime is achieved by having more understanding and control over the deployment environment but kubernetes takes that away. It attracts middle managers and CTOs because it seems like a silver bullet without getting your hands dirty but in reality it introduces so much chaos and indirections into your stack that you end up worse off than before, and all the while you’re emptying your pockets for this experience.
Just run your shit on a computer like normal, it’ll work fine.
This is true, but let’s not forget that Kubernetes also has some benefits.
Self-healing. That’s what I miss the most with a pure NixOS deployment. If the VM goes down, it requires manual intervention to be restored. I haven’t seen good solutions proposed for that yet. Maybe uptimerobot triggering the CI when the host goes down is enough. Then the CI can run terraform apply or some other provisioning script.
Zero-downtime deployment. This is not super necessary for personal infrastructures but is quite important for production environments.
Per pod IP. It’s quite nice not to have to worry about port clashes between services. I think this can be solved by using IPv6 as each host automatically gets a range of IPs to play with.
Auto-scaling. Again not super necessary for personal infrastructure but it’s nice to be able to scale beyond one host, and not to have to worry on which host one service lives.
Did anyone tried using Nomad for personal projects? It has self-healing and with the raw runner one can run executables directly on NixOS without needing any containers. I have not tried it myself (yet), but would be keen on hearing the experiences.
I am experimenting with the Hashiscorp stack while off for the holidays. I just brought up a vagrant box (1GB ram) with Consul, Docker and Nomad runing (no jobs yet) and the overhead looks okay:
total used free shared buff/cache available
Mem: 981Mi 225Mi 132Mi 0.0Ki 622Mi 604Mi
Swap: 1.9Gi 7.0Mi 1.9Gi
but probably too high to fit Postgres, Traefik or Fabio and a Rails app into it as well, but 2GB will probably be lots (I am kind of cheap so the less resources the better).
I have a side project running in ‘prod’ using Docker (for Postgres and my Rails app) along with Caddy running as a systemd service but it’s kind of a one off machine so I’d like to move towards something like Terraform (next up on the list to get running) for bring up and Nomad for the reasons you want something like that.
But… the question that does keep running through the back of my head, do I need even Nomad/Docker? For a prod env? Yes, it’s probably worth the extra complexity and overhead but for personal stuff? Probably not… Netlify, Heroku, etc are pretty easy and offer free tiers.
I was thinking about doing this but I haven’t done due diligence on it yet. Mostly because I only have 2 droplets right now and nobody depends on what’s running on them.
If you’re willing to go the Amazon route, EC2 has offered most of that for years. Rather than using the container as an abstraction, treat the VM as a container: run one main process per VM. And you then get autoscaling, zero downtime deploys, self-healing, and per-VM IPs.
TBH I think K8s is a step backwards for most orgs compared to just using cloud VMs, assuming you’re also running K8s in a cloud environment.
That’s a good point. And if you don’t care about uptime too much, autoscaling + spot instances is a pretty good fit.
The main downside is that a load-balancer is already ~15.-/month if I remember correctly. And the cost can explode quite quickly on AWS. It takes quite a bit of planning and effort to keep the cost super low.
IMO, Kubernetes’ main advantage isn’t in that it “manages services”. From that POV, everything you say is 100% spot-on. It simply moves complexity around, rather than reducing it.
The reason I like Kubernetes is something entirely different: It more or less forces a new, more robust application design.
Of course, many people try to shoe-horn their legacy applications into Kubernetes (the author running git in K8s appears to be one example), and this just adds more pain.
Use K8s for the right reasons, and for the right applications, and I think it’s appropriate. It gets a lot of negative press for people who try to use it for “everything”, and wonder why it’s not the panacea they were expecting.
I disagree that k8s forces more robust application design; fewer moving parts are usually a strong indicator of reliability.
Additionally, I think k8s removes some of the pain of microservices–in the same way that a local anathestic makes it easier to keep your hand in boiling water–that would normally help people reconsider their use.
Just run your shit on a computer like normal, it’ll work fine.
I think that’s an over-simplification. @zimbatm’s comment makes good points about self-healing and zero-downtime deployment. True, Kubernetes isn’t necessary for those things; an EC2 auto-scaling group would be another option. But one does need something more than just running a service on a single, fixed computer.
I don’t see how your situation/solution negates the statement.
You’ve simply traded one “something” (Kubernetes) with another (“the right provider”, and all that entails–probably redundant power supplies, network connections, hot-swappable hard drives, etc, etc).
The complexity still exists, just at a different layer of abstraction. I’ll grant you that it does make reasoning about the application simpler, but it makes reasoning about the hardware platform, and peripheral concerns, much more complex. Of course that can be appropriate, but it isn’t always.
I’m also unsure how a company’s profit margin figures into a discussion about service architectures…
I’m also unsure how a company’s profit margin figures into a discussion about service architectures…
There is no engineering without dollar signs in the equation. The only reason we’re being paid to play with shiny computers is to deliver business value–and while I’m sure a lot of “engineers” are happy to ignore the profit-motive of their host, it is very unwise to do so.
I’ll grant you that it does make reasoning about the application simpler, but it makes reasoning about the hardware platform, and peripheral concerns, much more complex.
That engineering still has to be done, if you’re going to do it at all. If you decide to reason about it, do you want to be able to shell into a box and lay hands on it immediately, or hope that your k8s setup hasn’t lost its damn mind in addition to whatever could be wrong with the app?
You’ve simply traded one “something” (Kubernetes) with another (“the right provider”, and all that entails–probably redundant power supplies, network connections, hot-swappable hard drives, etc, etc).
The complexity of picking which hosting provider you want to use (ignoring colocation issues) is orders and order of magnitudes less than learning and handling k8s. Hosting is basically a commodity at this point, and barring the occasional amazingly stupid thing among the common names there’s a baseline of competency you can count on.
People have been sold this idea that hosting a simple server means racking it and all the craziness of datacenters and whatnot, and it’s just a ten spot and an ssh key and you’re like 50% of the way there. It isn’t rocket surgery.
Servers with the right provider can stay up pretty well.
I was one of the victims of the DDOS that hit Linode on Christmas day (edit: in 2015; didn’t mean to omit that). DO and Vultr haven’t had perfect uptime either. So I’d rather not rely on single, static server deployments any more than I have to.
I’ve always been impressed by teams/companies maintaining a very small fleet of servers but I’ve never heard of any successful company running a single VM.
It was a boring little Ubuntu server if I recall correctly, I think like a 40USD general purpose instance. The second team had hacked together an impressive if somewhat janky system using the BEAM ecosystem, the first team had built the original platform in Meteor, both ran on the same box along with Mongo and supporting software. The system held under load (mostly, more about that in a second), and worked fine for its role in e-commerce stuff. S3 was used (as one does), and eventually as I said we moved to hosted options for database stuff…things that are worth paying for. Cloudflare for static assets, eventually.
What was the business environment?
Second CTO and fourth engineering team (when I was hired) had the mandate to ship some features and put out a bunch of fires. Third CTO and fifth engineering team (who were an amazing bunch and we’re still tight) shifted more to features and cleaning up technical debt. CEO (who grudgingly has my respect after other stupid things I’ve seen in other orgs) was very stingy about money, but also paid well. We were smart and well-compensated (well, basically) developers told to make do with little operational budget, and while the poor little server was pegged in the red for most of its brutish life, it wasn’t drowned in bullshit. CEO kept us super lean and focused on making the money funnel happy, and didn’t give a shit about technical features unless there was a dollar amount attached. This initially was vexing, but after a while the wisdom of the approach became apparent: we weathered changes in market conditions better without a bunch of outstanding bills, we had more independence from investors (for better or worse), and honestly the work was just a hell of a lot more interesting due in no small part to the limitations we worked under. This is key.
What problems did we have?
Support could be annoying, and I learned a lot about monitoring on that job during a week where the third CTO showed me how to setup Datadog and similar tooling to help figure out why we had intermittent outages–eventual solution was a cronjob to kill off a bloated process before it became too poorly behaved and brought down the box. The thing is, though, we had a good enough customer success team that I don’t think we even lost that much revenue, possibly none. That week did literally have a day or two of us watching graphs and manually kicking over stuff just in time, which was a bit stressful, but I’d take a month of that over sitting in meetings and fighting matrix management to get something deployed with Jenkins onto a half-baked k8s platform and fighting with Prometheus and Grafana and all that other bullshit…as a purely random example, of course. >:|
The sore spots we had were basically just solved by moving particular resource-hungry things (database mainly) to hosting–the real value of which was having nice tooling around backups and monitoring, and which moving to k8s or similar wouldn’t have helped with. And again, it was only after a few years of profitable growth that it traffic hit a point where that migration even seemed reasonable.
I think we eventually moved off of the droplet and onto an Amazon EC2 instance to make storage tweaks easier, but we weren’t using them in any way different than we’d use any other barebones hosting provider.
Did that one instance ever go completely down (becoming unreachable due to a networking issue also counts), either due to an unforeseen problem or scheduled maintenance by the hosting provider? If so, did the company have a procedure for bringing a replacement online in a timely fashion? If not, then I’d say you all just got very lucky.
Yes, and yes–the restart procedure became a lot simpler once we’d switched over to EC2 and had a hot spare available…but again, nothing terribly complicated and we had runbooks for everything because of the team dynamics (notice the five generations of engineering teams over the course of about as many years?). As a bonus, in the final generation I was around for we were able to hire a bunch of juniors and actually teach them enough to level them up.
About this “got very lucky” part…
I’ve worked on systems that had to have all of the 9s (healthcare). I’ve worked on systems, like this, that frankly had a pretty normal (9-5, M-F) operating window. Most developers I know are a little too precious about downtime–nobody’s gonna die if they can’t get to their stupid online app, most customers–if you’re delivering value at a price point they need and you aren’t specifically competing on reliability–will put up with inconvenience if your customer success people treat them well.
Everybody is scared that their stupid Uber-for-birdwatching or whatever app might be down for a whole hour once a month. Who the fuck cares? Most of these apps aren’t even monetizing their users properly (notice I didn’t say customers), so the odd duck that gets left in the lurch gets a hug and a coupon and you know what–the world keeps turning!
Ours is meant to be a boring profession with simple tools and innovation tokens spent wisely on real business problems–and if there aren’t real business problems, they should be spent making developers’ lives easier and lowering business costs. I have yet to see k8s deliver on any of this for systems that don’t require lots of servers.
(Oh, and speaking of…is it cheaper to fuck around with k8s and all of that, or just to pay Heroku to do it all for you? People are positively baffling in what they decide to spend money on.)
eventual solution was a cronjob to kill off a bloated process before it became too poorly behaved and brought down the box … That week did literally have a day or two of us watching graphs and manually kicking over stuff just in time, which was a bit stressful,…
It sounds like you were acting like human OOM killers, or more generally speaking manual resource limiters of those badly-behaved processes. Would it be fair to say that sort of thing would be done today by systemd through its cgroups resource management functionality?
We probably could’ve solved it through systemd with Limit*settings–we had that available at the time. For us, we had some other things (features on fire, some other stuff) that took priority, so just leaving a dashboard open and checking it every hour or two wasn’t too bad until somebody had the spare cycles to do the full fix.
I ended up doing this myself too. Went the whole route of having a cluster of k8s nodes on spare hardware, deployed EVERYTHING to it but ended up spending way too much time just getting stuff working, most notably having to pin deployments to boxes so port-forwarding would work. Also having a PVC provisioner is not fun.
It was a huge pain, but a good learning experience. The one thing I sort of miss about it was having a nginx ingress, so one can just create an ingress resource and ingress routes traffic to the pods. Caddy handles that now though.
I now use vultr for everything as well and am satisfied. It kind of goes to show that k8s really isn’t necessary even for container workloads. I’ve replaced my k8s deployments with podman run commands, from there you can generate a systemd unit file with podman generate systemd <containername>, then you have containers starting like services! It works great!
Absolutely! I recommend the –new and –name flags as well. That way it runs a new container every time and the service is the name you give it rather than the container SHA.
I have - but you have to jump through hoops to run OCI images e.g. using skopeo to copy the image down, extracting it, the spawning the chroot, firewall rules to forward traffic from the host into the container etc. Podman just does it all for you.
Forgive my ignorance, but it seems like just a boring droplet with some nginx would solve most of the problems here. What’s the driver for all that power?
EDIT:
Like, I spend 20USD/month at prgmr and have been quite happy with it for Gittea plus all kinds of other weird things.
I needed to learn Kubernetes for work and I decided to do so by moving a bunch of my own infrastructure to it. That’s the whole reason really. It’s really expensive in retrospect and I’m really looking at going back to a more simple setup as a result.
My personal answer to “why not ansible” would have been: it is slow, very tedious in some aspects (running locally, obeying your ~/.ssh/config or .netrc) and you need to write yaml. Personally I have moved to pyinfra which fixes all of these and more.
I learned Kubernetes the same way. Although I hosted it on Google. I’m still paying for that cluster, far more than it’s worth, simply because I can’t be bothered to migrate to something else (a $5/mo droplet or equivalent would do the trick).
That said, using a hosted K8s solution I think makes a lot more sense for a small project–although it also potentially increases the cost significantly for a hobby project.
I guess IMO, K8s probably isn’t the right tool for most hobby projects (outside of learning K8s). For small teams, hosted K8s is often a good choice.
That totally makes sense! I just didn’t know if there was some other technological forcing function, given the relative expense of some of the options in your table. Also, what’s the hacks column about?
Is there a way to install NixOS on a Hetzner cloud box using the ISO fully automated with Terraform or Ansible? Everything I’ve read about it involves manual steps :/
Recently built my personal Kubernetes cluster but with some “tricks” to make it easier to reason about and maintain.
Using k3s eases things up because running a new node is running a single command. The memory usage is also lower and comes with Helm and a Traefik Ingress included.
Thanks to its simple installation process, there’s no Terraform code or similar, because I can manually run the command myself.
Configuration is written as Jsonnet scripts, with a function “App” that creates the typical setup of deployment, service, ingress and certificate given an image, a domain and the needed environment variables. This is the configuration.
It’s only one Hetzner machine, it’s easier and reasonable for my workloads to just use a bigger machine instead of having multiple nodes. (And you could say why in hell are you using kube for a single node? Well it’s simpler than a bunch of bash scripts glued together, and have cool stuff like zero downtime deployments included)
the only other thing in the machine is a PostgreSQL database that serves as the only state storage for the apps, so, I limit myself to only apps that require no other thing than a Postgres. Maybe in the future I add some disk storage in the cluster or something like an S3 self hosted api.
Managing the cluster (kubectl and so) is done by running kubectl proxy on the server and making a ssh tunnel to a local port. No need to deal with certificates and is done with a simple script.
It’s honestly the simpler approach I could find to have something like my “personal cloud”.
It would be really nice if Digital Ocean let you upload arbitrary ISO files and go from there, but that is apparently not the world we live in.
My cloud VM is a NixOS on DigitalOcean. I can dig up the details of how that works if you want @cadey. I build a NixOS VM with some config stuff for DO, upload the image, and run that.
I went for yet another option and decided to colo my older intel i3 nuc with 16gb of memory and 128gb + 1tb of storage. I pay 70euro per 6months and get 2 ipv4 addresses and the standard /64 for ipv6. Transfer volume included is so high I don’t think I even managed to get to 5%.
This was to be honest the cheapest solution I could find, and I can always request my hardware back, I think the fee for that is 20euro. I can also pay a one time fee of 25euro to get an access card to go visit my nuc myself and swap hardware/reboot it. The cost of the server is highly tied to the amount electricity my hardware is using, so I could lower the cost/gain more by upgrading the hardware to something more powerful that’s using the same amount of energy.
You’d be surprised to see how many companies let you colo small devices like nuc’s or even a raspberry pi for very low cost. It’s a route not many people take, but to be honest it works great for me and might be worth to have a look.
I was talking to colleagues about 1.5year ago and complained about the lack of good quality vps providers for a low price. One of them said something like “it’s a shame you can’t put your nuc in a datacenter” and I remembered some people exactly do that. I looked for colo providers; mailed 2 that offered NAS hosting because it matched most, and both replied they had no issue with putting a nuc in and gave me a price.
I could drop the nuc off personally, or send it to them via registered & insured mail. I choose the latter, and 2 days later it was installed and up and running. Went extremely well, and I had a 3minute downtime during the 1.5 years due to a planned maintenance when they upgraded the DDoS protection service.
I think transfer volume included is something like 5tb a month. Like I said that’s more than enough for me.
I just did a search for colo providers and picked a couple that also offered NAS hosting. Since my nuc and a synology NAS are about the same volume (physically) I mailed them and both said it’s no problem and gave me the price. I picked the more expensive one actually, because they offered extra’s as the option to go and look at my hardware myself.
I cross-transfer between my home nuc and my colo nuc via syncthing. It’s probably not the best or most glorious solution, but it works like a charm for me. I don’t generate a lot of data, to be honest, so I wasn’t looking for an advanced solution.
Looking at your list of services .. you can probably run all of those on a single 1GB / $5/month instance and dockerize or even just systemd them. And still under utilize that host.
This is exactly my experience. The state of the Kubernetes-management ecosystem is awful. At risk of spoiling my upcoming blog post on how to make it less bad… Instead of YAML, I write JSONnet whenever possible (like here). JSONnet is a pure superset of JSON with variables, functions, and other conveniences.
Pretty soon I want to remove all the YAML from that repository with some clever scripts for e.x. compiling values.jsonnet -> values.yaml for helm.
A colleague of mine made a good point about YAML: the many projects that abuse it probably should have never used it. On the other hand, it’s way better for static data like test fixtures. But in thase cases, you probably aren’t using nearly as much of the YAML spec as you would be for configuration.
I’ve been tossing up similar thinking in the last few months, my personal stuff is all on the last iteration of me learning infra-stuff which at the time was SmartOS powered zones, configured by puppet. Spent a couple of days trying to get my head round the benefits of Kubernetes (partly due to exploring this space at work too), and decided it sounds like a lot of work to get running. (Especially if you’re not going “Hey $provider, just run it for me” for $reasons.)
Currently rebuilding the home stuff using Consul/Nomad/Traefik/Vault which appears to bring most of the benefits I’d get from Kubernetes with a fraction of the complexity.
This post hits home hard. I setup a personal K8s setup for exactly the same reasons, to learn by doing.
However, the maintainence became a pain point when I had to deploy Statefulset apps and K3s on ARM didn’t really have Longhorn support. Which was a deal breaker for my cluster setup.
I shifted it gradually to a single node $10 DO instance, managing all services in Docker containers, configs/DNS via Terraform. The most beautiful part of the stack is automatic SSL from Caddy, it just works out of the box.
Ah, good to know. I am currently implementing a service using docker-compose and had thought about learning k8s and deploying it to a managed service, though it won’t see a huge amount of traffic to begin with (or maybe ever). That’s one more point to the “worry about it later” side. I will probably just deploy it with docker-compose on an EC2 instance.
Note that while DigitalOcean doesn’t have the ability to ingest an ISO, it does allow you to import raw disk images you’ve made yourself as a custom image – either uploaded through the web interface or passed to their API as a URL to fetch. I use this pretty regularly to import OmniOS images, which DO don’t provide themselves. You just need a metadata agent to configure the networking and SSH keys and such; I hear cloud-init works well.
I’ve also stood up my own disk images on SoYouStart by booting the FreeBSD rescue image on the machine and just using dd to drop an installed OS onto the SSD and then rebooting.
Ahh yes, fair enough. They also don’t support doing floating IPs on custom images even though it’d just be the metadata agent correctly configuring the anchor IP they use. If you hollow out an Ubuntu image by replacing the disk contents it totally all works, so it’s a pretty frustrating artificial limitation. I talked to their support people who either didn’t understand what I asked, or aren’t empowered to talk to their engineering group.
I’m very curious as to what put you in mindset that you maybe somehow were?
NixOS, on the other hand, is a lot simpler overall and I would like to use it for running my services where I can.
I can attest to this as well (my “scale” being a few Pis) recently having had to deal with swapping from an x86 to an ARM device…which I thought would be more difficult?
but YAML is a HORRIBLE format for configuration.
ha, I just recently thought “wait, does this thing care about tabs?” does it? who knows.
Dhall
you can think of as: JSON + functions + types + imports
I have been running NixOS on Hetzner and Vultr VPSes the last two years. Both support booting from an ISO image. I manage the machines (as well as some home machines) with morph, with does the job admirably. It is a stateless (in contrast to NixOps) wrapper around nix-copy-closure with some perks such as health checks.
NixOS works great for servers. The updates between major releases have been flawless so far and having your full system configuration reproducibly in a git repo gives peace of mind.
Sure, it’s not a setup with auto-scaling, etc. But my personal and family use consists of: serving static sites that should be able to handle a few hits per second, S3-compatible blob storage using minio, and a couple of MoinMoin wikis. All of which can be done easily with 1-2 small VPSes.
I was (am?) on the Docker train, I did deploy two Docker Swarm clusters, but I never got around to Kubernetes. And at this point, I’m wondering (hoping?) whether I can just hold out until the next shiny thing comes along.
Docker is ok as a packaging format. I quite like the idea around layers. However I can’t shake the feeling that as runtime it’s rather wasteful use of hardware. If you run a k8 cluster on amazon it’s like virtualization upon virtualization (upon whatever virtualization Amazon uses we don’t see). This comes with a cost both in managing the complexity and use of hardware.
To top it off we have the hopelessly inefficient enterprise sector adding stuff like sidecar attachments for intrusion detection and deep packet inspection of these virtual networks.
I’m interested in trends that go the other way. Rust is cool, because with it comes a re-focus on efficient computing. Statically linked binaries would be a much simpler way of packaging than containers.
Have you tried Morph and NixOps? I’m currently planning on switching to Nix flakes for my systems, and am wondering what deployment tool will work best.
Pro tip: this applies to you if you’re a business too. Kubernetes is a problem as much as it is a solution.
Uptime is achieved by having more understanding and control over the deployment environment but kubernetes takes that away. It attracts middle managers and CTOs because it seems like a silver bullet without getting your hands dirty but in reality it introduces so much chaos and indirections into your stack that you end up worse off than before, and all the while you’re emptying your pockets for this experience.
Just run your shit on a computer like normal, it’ll work fine.
This is true, but let’s not forget that Kubernetes also has some benefits.
Self-healing. That’s what I miss the most with a pure NixOS deployment. If the VM goes down, it requires manual intervention to be restored. I haven’t seen good solutions proposed for that yet. Maybe uptimerobot triggering the CI when the host goes down is enough. Then the CI can run
terraform apply
or some other provisioning script.Zero-downtime deployment. This is not super necessary for personal infrastructures but is quite important for production environments.
Per pod IP. It’s quite nice not to have to worry about port clashes between services. I think this can be solved by using IPv6 as each host automatically gets a range of IPs to play with.
Auto-scaling. Again not super necessary for personal infrastructure but it’s nice to be able to scale beyond one host, and not to have to worry on which host one service lives.
Did anyone tried using Nomad for personal projects? It has self-healing and with the raw runner one can run executables directly on NixOS without needing any containers. I have not tried it myself (yet), but would be keen on hearing the experiences.
I am experimenting with the Hashiscorp stack while off for the holidays. I just brought up a vagrant box (1GB ram) with Consul, Docker and Nomad runing (no jobs yet) and the overhead looks okay:
but probably too high to fit Postgres, Traefik or Fabio and a Rails app into it as well, but 2GB will probably be lots (I am kind of cheap so the less resources the better).
I have a side project running in ‘prod’ using Docker (for Postgres and my Rails app) along with Caddy running as a systemd service but it’s kind of a one off machine so I’d like to move towards something like Terraform (next up on the list to get running) for bring up and Nomad for the reasons you want something like that.
But… the question that does keep running through the back of my head, do I need even Nomad/Docker? For a prod env? Yes, it’s probably worth the extra complexity and overhead but for personal stuff? Probably not… Netlify, Heroku, etc are pretty easy and offer free tiers.
I was thinking about doing this but I haven’t done due diligence on it yet. Mostly because I only have 2 droplets right now and nobody depends on what’s running on them.
If you’re willing to go the Amazon route, EC2 has offered most of that for years. Rather than using the container as an abstraction, treat the VM as a container: run one main process per VM. And you then get autoscaling, zero downtime deploys, self-healing, and per-VM IPs.
TBH I think K8s is a step backwards for most orgs compared to just using cloud VMs, assuming you’re also running K8s in a cloud environment.
That’s a good point. And if you don’t care about uptime too much, autoscaling + spot instances is a pretty good fit.
The main downside is that a load-balancer is already ~15.-/month if I remember correctly. And the cost can explode quite quickly on AWS. It takes quite a bit of planning and effort to keep the cost super low.
IMO, Kubernetes’ main advantage isn’t in that it “manages services”. From that POV, everything you say is 100% spot-on. It simply moves complexity around, rather than reducing it.
The reason I like Kubernetes is something entirely different: It more or less forces a new, more robust application design.
Of course, many people try to shoe-horn their legacy applications into Kubernetes (the author running git in K8s appears to be one example), and this just adds more pain.
Use K8s for the right reasons, and for the right applications, and I think it’s appropriate. It gets a lot of negative press for people who try to use it for “everything”, and wonder why it’s not the panacea they were expecting.
I disagree that k8s forces more robust application design; fewer moving parts are usually a strong indicator of reliability.
Additionally, I think k8s removes some of the pain of microservices–in the same way that a local anathestic makes it easier to keep your hand in boiling water–that would normally help people reconsider their use.
And overhead. Those monster yaml files are absurd in so many levels.
I think that’s an over-simplification. @zimbatm’s comment makes good points about self-healing and zero-downtime deployment. True, Kubernetes isn’t necessary for those things; an EC2 auto-scaling group would be another option. But one does need something more than just running a service on a single, fixed computer.
I respectfully disagree…worked at a place which made millions over a few years with a single comically overloaded DO droplet.
We eventually made it a little happier by moving to hosted services for Mongo and giving it a slightly beefier machine, but otherwise it was fine.
The single machine design made things a lot easier to reason about, fix, and made CI/CD simpler to implement as well.
Servers with the right provider can stay up pretty well.
I don’t see how your situation/solution negates the statement.
You’ve simply traded one “something” (Kubernetes) with another (“the right provider”, and all that entails–probably redundant power supplies, network connections, hot-swappable hard drives, etc, etc).
The complexity still exists, just at a different layer of abstraction. I’ll grant you that it does make reasoning about the application simpler, but it makes reasoning about the hardware platform, and peripheral concerns, much more complex. Of course that can be appropriate, but it isn’t always.
I’m also unsure how a company’s profit margin figures into a discussion about service architectures…
There is no engineering without dollar signs in the equation. The only reason we’re being paid to play with shiny computers is to deliver business value–and while I’m sure a lot of “engineers” are happy to ignore the profit-motive of their host, it is very unwise to do so.
That engineering still has to be done, if you’re going to do it at all. If you decide to reason about it, do you want to be able to shell into a box and lay hands on it immediately, or hope that your k8s setup hasn’t lost its damn mind in addition to whatever could be wrong with the app?
The complexity of picking which hosting provider you want to use (ignoring colocation issues) is orders and order of magnitudes less than learning and handling k8s. Hosting is basically a commodity at this point, and barring the occasional amazingly stupid thing among the common names there’s a baseline of competency you can count on.
People have been sold this idea that hosting a simple server means racking it and all the craziness of datacenters and whatnot, and it’s just a ten spot and an ssh key and you’re like 50% of the way there. It isn’t rocket surgery.
I was one of the victims of the DDOS that hit Linode on Christmas day (edit: in 2015; didn’t mean to omit that). DO and Vultr haven’t had perfect uptime either. So I’d rather not rely on single, static server deployments any more than I have to.
can you share more details about this?
I’ve always been impressed by teams/companies maintaining a very small fleet of servers but I’ve never heard of any successful company running a single VM.
It was a boring little Ubuntu server if I recall correctly, I think like a 40USD general purpose instance. The second team had hacked together an impressive if somewhat janky system using the BEAM ecosystem, the first team had built the original platform in Meteor, both ran on the same box along with Mongo and supporting software. The system held under load (mostly, more about that in a second), and worked fine for its role in e-commerce stuff. S3 was used (as one does), and eventually as I said we moved to hosted options for database stuff…things that are worth paying for. Cloudflare for static assets, eventually.
What was the business environment?
Second CTO and fourth engineering team (when I was hired) had the mandate to ship some features and put out a bunch of fires. Third CTO and fifth engineering team (who were an amazing bunch and we’re still tight) shifted more to features and cleaning up technical debt. CEO (who grudgingly has my respect after other stupid things I’ve seen in other orgs) was very stingy about money, but also paid well. We were smart and well-compensated (well, basically) developers told to make do with little operational budget, and while the poor little server was pegged in the red for most of its brutish life, it wasn’t drowned in bullshit. CEO kept us super lean and focused on making the money funnel happy, and didn’t give a shit about technical features unless there was a dollar amount attached. This initially was vexing, but after a while the wisdom of the approach became apparent: we weathered changes in market conditions better without a bunch of outstanding bills, we had more independence from investors (for better or worse), and honestly the work was just a hell of a lot more interesting due in no small part to the limitations we worked under. This is key.
What problems did we have?
Support could be annoying, and I learned a lot about monitoring on that job during a week where the third CTO showed me how to setup Datadog and similar tooling to help figure out why we had intermittent outages–eventual solution was a cronjob to kill off a bloated process before it became too poorly behaved and brought down the box. The thing is, though, we had a good enough customer success team that I don’t think we even lost that much revenue, possibly none. That week did literally have a day or two of us watching graphs and manually kicking over stuff just in time, which was a bit stressful, but I’d take a month of that over sitting in meetings and fighting matrix management to get something deployed with Jenkins onto a half-baked k8s platform and fighting with Prometheus and Grafana and all that other bullshit…as a purely random example, of course.
>:|
The sore spots we had were basically just solved by moving particular resource-hungry things (database mainly) to hosting–the real value of which was having nice tooling around backups and monitoring, and which moving to k8s or similar wouldn’t have helped with. And again, it was only after a few years of profitable growth that it traffic hit a point where that migration even seemed reasonable.
I think we eventually moved off of the droplet and onto an Amazon EC2 instance to make storage tweaks easier, but we weren’t using them in any way different than we’d use any other barebones hosting provider.
Did that one instance ever go completely down (becoming unreachable due to a networking issue also counts), either due to an unforeseen problem or scheduled maintenance by the hosting provider? If so, did the company have a procedure for bringing a replacement online in a timely fashion? If not, then I’d say you all just got very lucky.
Yes, and yes–the restart procedure became a lot simpler once we’d switched over to EC2 and had a hot spare available…but again, nothing terribly complicated and we had runbooks for everything because of the team dynamics (notice the five generations of engineering teams over the course of about as many years?). As a bonus, in the final generation I was around for we were able to hire a bunch of juniors and actually teach them enough to level them up.
About this “got very lucky” part…
I’ve worked on systems that had to have all of the 9s (healthcare). I’ve worked on systems, like this, that frankly had a pretty normal (9-5, M-F) operating window. Most developers I know are a little too precious about downtime–nobody’s gonna die if they can’t get to their stupid online app, most customers–if you’re delivering value at a price point they need and you aren’t specifically competing on reliability–will put up with inconvenience if your customer success people treat them well.
Everybody is scared that their stupid Uber-for-birdwatching or whatever app might be down for a whole hour once a month. Who the fuck cares? Most of these apps aren’t even monetizing their users properly (notice I didn’t say customers), so the odd duck that gets left in the lurch gets a hug and a coupon and you know what–the world keeps turning!
Ours is meant to be a boring profession with simple tools and innovation tokens spent wisely on real business problems–and if there aren’t real business problems, they should be spent making developers’ lives easier and lowering business costs. I have yet to see k8s deliver on any of this for systems that don’t require lots of servers.
(Oh, and speaking of…is it cheaper to fuck around with k8s and all of that, or just to pay Heroku to do it all for you? People are positively baffling in what they decide to spend money on.)
It sounds like you were acting like human OOM killers, or more generally speaking manual resource limiters of those badly-behaved processes. Would it be fair to say that sort of thing would be done today by systemd through its cgroups resource management functionality?
We probably could’ve solved it through systemd with
Limit*
settings–we had that available at the time. For us, we had some other things (features on fire, some other stuff) that took priority, so just leaving a dashboard open and checking it every hour or two wasn’t too bad until somebody had the spare cycles to do the full fix.I ended up doing this myself too. Went the whole route of having a cluster of k8s nodes on spare hardware, deployed EVERYTHING to it but ended up spending way too much time just getting stuff working, most notably having to pin deployments to boxes so port-forwarding would work. Also having a PVC provisioner is not fun.
It was a huge pain, but a good learning experience. The one thing I sort of miss about it was having a nginx ingress, so one can just create an ingress resource and ingress routes traffic to the pods. Caddy handles that now though.
I now use vultr for everything as well and am satisfied. It kind of goes to show that k8s really isn’t necessary even for container workloads. I’ve replaced my k8s deployments with
podman run
commands, from there you can generate a systemd unit file withpodman generate systemd <containername>
, then you have containers starting like services! It works great!Thanks for mentioning that, I didn’t know about it. That’s really good to know!
Absolutely! I recommend the –new and –name flags as well. That way it runs a new container every time and the service is the name you give it rather than the container SHA.
Have you looked at systemd-nspawn?
I have - but you have to jump through hoops to run OCI images e.g. using skopeo to copy the image down, extracting it, the spawning the chroot, firewall rules to forward traffic from the host into the container etc. Podman just does it all for you.
Maybe I’m just lazy but the UX is nicer.
Forgive my ignorance, but it seems like just a boring droplet with some nginx would solve most of the problems here. What’s the driver for all that power?
EDIT:
Like, I spend 20USD/month at prgmr and have been quite happy with it for Gittea plus all kinds of other weird things.
I needed to learn Kubernetes for work and I decided to do so by moving a bunch of my own infrastructure to it. That’s the whole reason really. It’s really expensive in retrospect and I’m really looking at going back to a more simple setup as a result.
I remember when you started learning and I asked why not just Ansible?
Now I see you regretting your choice ~1.5 years later. So I ask again: why not just Ansible? :D
I’ve been using Ansible to manage my personal infra for the same amount of time (I started learning Ansible when you started K8s) and love it.
TL;DR: Ansible describes how to get to your desired system state. NixOS describes your desired system state.
AFAIK Ansible also described the desired system state. E.g.
Something like that.
That describes the fact that you want the service started, not the fact that the service has a bunch of configuration for it. See here for an example: https://github.com/Xe/nixos-configs/blob/master/common/services/xesite.nix
My personal answer to “why not ansible” would have been: it is slow, very tedious in some aspects (running locally, obeying your ~/.ssh/config or .netrc) and you need to write yaml. Personally I have moved to pyinfra which fixes all of these and more.
I learned Kubernetes the same way. Although I hosted it on Google. I’m still paying for that cluster, far more than it’s worth, simply because I can’t be bothered to migrate to something else (a $5/mo droplet or equivalent would do the trick).
That said, using a hosted K8s solution I think makes a lot more sense for a small project–although it also potentially increases the cost significantly for a hobby project.
I guess IMO, K8s probably isn’t the right tool for most hobby projects (outside of learning K8s). For small teams, hosted K8s is often a good choice.
That totally makes sense! I just didn’t know if there was some other technological forcing function, given the relative expense of some of the options in your table. Also, what’s the
hacks
column about?Hacks needed to install NixOS.
Ah cool.
Prgmr actually has a NixOS image you can start with, though I think it’s a couple versions back.
I just wanted to mention https://www.hetzner.com/cloud which is also quite cheap, and can boot on a NixOS ISO as well. <3 Hetzner
Is there a way to install NixOS on a Hetzner cloud box using the ISO fully automated with Terraform or Ansible? Everything I’ve read about it involves manual steps :/
Not with the ISO but the nixos-infect approach works quite nicely. Here is an example from my own Terraform:
Thanks!!!
Do you then do provisioning with Ansible, or is the next step Terraform, too?
Still Terraform, using https://github.com/tweag/terraform-nixos/tree/master/deploy_nixos that I wrote a while ago.
I generally try to keep everything in Terraform as much as possible.
thanks, that looks great!
Recently built my personal Kubernetes cluster but with some “tricks” to make it easier to reason about and maintain.
It’s honestly the simpler approach I could find to have something like my “personal cloud”.
Edit: Added a link
Thanks for showing your config and including all shell scripts in the repo!
My cloud VM is a NixOS on DigitalOcean. I can dig up the details of how that works if you want @cadey. I build a NixOS VM with some config stuff for DO, upload the image, and run that.
I have a NixOS server on DigitalOcean.
nixos-infect
worked wonderfully, I just copied my configuration files on and up it went.I went for yet another option and decided to colo my older intel i3 nuc with 16gb of memory and 128gb + 1tb of storage. I pay 70euro per 6months and get 2 ipv4 addresses and the standard /64 for ipv6. Transfer volume included is so high I don’t think I even managed to get to 5%.
This was to be honest the cheapest solution I could find, and I can always request my hardware back, I think the fee for that is 20euro. I can also pay a one time fee of 25euro to get an access card to go visit my nuc myself and swap hardware/reboot it. The cost of the server is highly tied to the amount electricity my hardware is using, so I could lower the cost/gain more by upgrading the hardware to something more powerful that’s using the same amount of energy.
You’d be surprised to see how many companies let you colo small devices like nuc’s or even a raspberry pi for very low cost. It’s a route not many people take, but to be honest it works great for me and might be worth to have a look.
How did you find out about the collocation?
I was talking to colleagues about 1.5year ago and complained about the lack of good quality vps providers for a low price. One of them said something like “it’s a shame you can’t put your nuc in a datacenter” and I remembered some people exactly do that. I looked for colo providers; mailed 2 that offered NAS hosting because it matched most, and both replied they had no issue with putting a nuc in and gave me a price.
I could drop the nuc off personally, or send it to them via registered & insured mail. I choose the latter, and 2 days later it was installed and up and running. Went extremely well, and I had a 3minute downtime during the 1.5 years due to a planned maintenance when they upgraded the DDoS protection service.
What’s the volume and where do you find them?
I think transfer volume included is something like 5tb a month. Like I said that’s more than enough for me.
I just did a search for colo providers and picked a couple that also offered NAS hosting. Since my nuc and a synology NAS are about the same volume (physically) I mailed them and both said it’s no problem and gave me the price. I picked the more expensive one actually, because they offered extra’s as the option to go and look at my hardware myself.
How do you handle data backups?
I cross-transfer between my home nuc and my colo nuc via syncthing. It’s probably not the best or most glorious solution, but it works like a charm for me. I don’t generate a lot of data, to be honest, so I wasn’t looking for an advanced solution.
Looking at your list of services .. you can probably run all of those on a single 1GB / $5/month instance and dockerize or even just systemd them. And still under utilize that host.
I’m gonna systemd them yeah.
You could runit them on void linux musl and have room for a minecraft server!
This is exactly my experience. The state of the Kubernetes-management ecosystem is awful. At risk of spoiling my upcoming blog post on how to make it less bad… Instead of YAML, I write JSONnet whenever possible (like here). JSONnet is a pure superset of JSON with variables, functions, and other conveniences.
Pretty soon I want to remove all the YAML from that repository with some clever scripts for e.x. compiling values.jsonnet -> values.yaml for helm.
I’m sorry but I’ve been screwed by so many “better than yaml” tools that I just want to remove the entire yaml everything from the equation.
“YAML is a HORRIBLE format for configuration”
I completely agree! Although I’d go one step further and say simply “YAML is a HORRIBLE format”.
A colleague of mine made a good point about YAML: the many projects that abuse it probably should have never used it. On the other hand, it’s way better for static data like test fixtures. But in thase cases, you probably aren’t using nearly as much of the YAML spec as you would be for configuration.
I’ve been tossing up similar thinking in the last few months, my personal stuff is all on the last iteration of me learning infra-stuff which at the time was SmartOS powered zones, configured by puppet. Spent a couple of days trying to get my head round the benefits of Kubernetes (partly due to exploring this space at work too), and decided it sounds like a lot of work to get running. (Especially if you’re not going “Hey
$provider
, just run it for me” for$reasons
.)Currently rebuilding the home stuff using Consul/Nomad/Traefik/Vault which appears to bring most of the benefits I’d get from Kubernetes with a fraction of the complexity.
This post hits home hard. I setup a personal K8s setup for exactly the same reasons, to learn by doing.
However, the maintainence became a pain point when I had to deploy Statefulset apps and K3s on ARM didn’t really have Longhorn support. Which was a deal breaker for my cluster setup.
I shifted it gradually to a single node $10 DO instance, managing all services in Docker containers, configs/DNS via Terraform. The most beautiful part of the stack is automatic SSL from Caddy, it just works out of the box.
I’m planning to revamp the docs and add a module to help me with deduplicating the configs, but if anyone’s interested: https://github.com/mr-karan/hydra/tree/master/floyd/terraform
Would managed Kubernetes from Digital Ocean or AWS be another option? I know it doesn’t solve the YAML problems but would probably save time and cost?
I am using managed kubernetes from Digital Ocean at the moment. It’s a money pit for what I need to do.
Ah, good to know. I am currently implementing a service using docker-compose and had thought about learning k8s and deploying it to a managed service, though it won’t see a huge amount of traffic to begin with (or maybe ever). That’s one more point to the “worry about it later” side. I will probably just deploy it with docker-compose on an EC2 instance.
Note that while DigitalOcean doesn’t have the ability to ingest an ISO, it does allow you to import raw disk images you’ve made yourself as a custom image – either uploaded through the web interface or passed to their API as a URL to fetch. I use this pretty regularly to import OmniOS images, which DO don’t provide themselves. You just need a metadata agent to configure the networking and SSH keys and such; I hear cloud-init works well.
I’ve also stood up my own disk images on SoYouStart by booting the FreeBSD rescue image on the machine and just using
dd
to drop an installed OS onto the SSD and then rebooting.Custom images don’t have IPv6 support. I’ve already gone this route :(
Ahh yes, fair enough. They also don’t support doing floating IPs on custom images even though it’d just be the metadata agent correctly configuring the anchor IP they use. If you hollow out an Ubuntu image by replacing the disk contents it totally all works, so it’s a pretty frustrating artificial limitation. I talked to their support people who either didn’t understand what I asked, or aren’t empowered to talk to their engineering group.
I’m very curious as to what put you in mindset that you maybe somehow were?
I can attest to this as well (my “scale” being a few Pis) recently having had to deal with swapping from an x86 to an ARM device…which I thought would be more difficult?
ha, I just recently thought “wait, does this thing care about tabs?” does it? who knows.
…
yikes!
really cool!
Thanks for sharing! Learned lots from this post and really appreciate it.
Seems like a very sensible change!
I have been running NixOS on Hetzner and Vultr VPSes the last two years. Both support booting from an ISO image. I manage the machines (as well as some home machines) with morph, with does the job admirably. It is a stateless (in contrast to NixOps) wrapper around
nix-copy-closure
with some perks such as health checks.NixOS works great for servers. The updates between major releases have been flawless so far and having your full system configuration reproducibly in a git repo gives peace of mind.
Sure, it’s not a setup with auto-scaling, etc. But my personal and family use consists of: serving static sites that should be able to handle a few hits per second, S3-compatible blob storage using minio, and a couple of MoinMoin wikis. All of which can be done easily with 1-2 small VPSes.
I was (am?) on the Docker train, I did deploy two Docker Swarm clusters, but I never got around to Kubernetes. And at this point, I’m wondering (hoping?) whether I can just hold out until the next shiny thing comes along.
Docker is ok as a packaging format. I quite like the idea around layers. However I can’t shake the feeling that as runtime it’s rather wasteful use of hardware. If you run a k8 cluster on amazon it’s like virtualization upon virtualization (upon whatever virtualization Amazon uses we don’t see). This comes with a cost both in managing the complexity and use of hardware.
To top it off we have the hopelessly inefficient enterprise sector adding stuff like sidecar attachments for intrusion detection and deep packet inspection of these virtual networks.
I’m interested in trends that go the other way. Rust is cool, because with it comes a re-focus on efficient computing. Statically linked binaries would be a much simpler way of packaging than containers.
k8s/docker/etc. don’t need to be virtualized, that is one of their selling points. Dunno if that’s how AWS does it, though.
I did the same and landed on Hetzner Cloud + NixOS. I’m pretty happy with it!
I use krops to manage it. Here is a simple script for initial setup.
Have you tried Morph and NixOps? I’m currently planning on switching to Nix flakes for my systems, and am wondering what deployment tool will work best.