Excellent write-up! I’ve given a talk on a number of occasions about why Nomad is better than Kubernetes, as I too can’t think of any situations (other than the operator one you mention), where I think kubernetes is a better fit.
Watched your talk and have some points of disagreement:
YAML is criticized extensively in the talk (with a reason, it’s painful) as being an inherent part of Kubernetes, when reality is that it’s optional, as you can use JSON too. And, most importanty, as you can use JSON in k8s definitions, anything that outputs JSON can work as a configuration language. You’re not tied to YAML in k8s, the results you can get with stuff like Jsonnet is way superior to plain YAML here.
I don’t think that comparing k8s to Nomad is entirely fair, as they are tools designed with different purposes. Kubernetes is oriented to fixing all the quirks of having networked cooperative systems. Nomad is way more generic and it only solves the workload orchestration part of the equation. As you well explained in the talk, you have to provide your own missing pieces to make it work for your specific use case. In a similar (and intended) example, there are many minimalistic init systems for Linux that give you total freedom and recombination… but Systemd has it’s specific use cases in which it makes sense and just works. UNIX philosophy isn’t a silver bullet, some times having multiple functionalities tied together in a single package makes sense for solving specific, common problems efficiently.
About the complexity of running a Kubernetes cluster: True, k8s as it is, is a PITA to administrate and it’s WAY better to externalize it to any cloud provider, but there are projects like the one mentioned in the article, k3s.io, that simplifies a lot the management.
One thing we can agree 100% is that neither Kubernetes or Nomad should be the default tool for solving any problem, and that we should prefer solutions that are simpler, easier to reason about.
Anyone got any experience with that? Seems like a nice way to run plain binaries without having to use docker images (For example when the binaries are compiled with Go).
We’ve been using the java driver in production for over 2 years now, we also us the exec driver for smaller tools, basically shell scripts to backup the consul and nomad database.
Ah good, other people are blazing this trail too. Thanks for the writeup, it validates some thoughts in my head. I’m also rebuilding my homelab setup with Hashistack (mostly because Nomad) and Tailscale. 😃
When I was starting to look into orchestrators at work, @pondidum’s talk on Nomad really resonated with my initial thoughts on Kubernetes and lead me to look at Nomad more seriously as an alternative. Massively useful, huge thanks for writing the talk.
Nice writeup, thanks! I’m also thinking of taking the Nomad route, as I would like to stay away from k8s as far as possible.
One question related to VPN? Is that for connecting to internal services when traveling, or you use VPN even at home? If the latter, what is the benefit? I’m trying to understand do I need one if all my machines are at home.
Actually Tailscale is a mesh network. So only the traffic to IP ranges under Tailscale CIDR flows via Tailscale.
And yes I use them at home also. My server is a DO droplet, so practically no difference if I’m at home or travelling. But even at home, I’ve 2 RPi nodes and I just prefer to host anything internal on Tailscale ranges (so that any new device gets the access automatically, don’t have to fiddle with local IPs or local DNS resolvers). Makes the setup pretty clean :)
There’s quite some big enterprise businesses that most likely you and everyone here rely on on a regular basis that completely base on Nomad.
9/10 times the error won’t be helpful. In my experience that’s not true. Something that bothers me about Kubernetes is that it’s rather easy to create silent problems. In fact that’s something I hunt down a lot in Kubernetes setups and it’s becoming a skill, but it certainly speaks against Kubernetes. Looking at a clearly non functional system where everything lights up green can be frustrating. Don’t get me wrong, there were clear mistakes behind those and those weren’t bugs in Kubernetes. The reality still is that this is a typical problem in Kubernetes setups and not as common in Nomad setups. There’s various reasons for that.
I’ve had the chance to work with both Kubernetes and Nomad setups in big production setups. Both work. But due to complexity you are fast more likely to be the first one to be the first to have an issue on kubernetes. Well maybe other than Amazon or Google, but that doesn’t matter cause again due to complexities of not just Kubernetes itself, but Operators once things fail they tend to fail badly.
Another reason for not recommending Kubernetes for big enterprise setups is stability. Kubernetes for better or for worse develops rather quickly and has a quite big amount of breaking changes. While in a small start-up that’s something you can have done after a bit of work (and don’t underestimate that) in a bigger setting I have seen more than one company significantly falling behind on kubernetes versions.
With the rise of operators I see that problem only growing.
Most of the mentioned problems don’t exist with Nomad. Sure there’s other things that Kubernetes does better, but based on the last few years of experience if someone would ask me for a recommendation using the current state of both Kubernetes and and Nomad I’d strongly suggest Nomad. Outside of cloud providers I also know of bigger Nomad than Kubernetes setups.
This is something that changed though and wasn’t always so. While I have been interested in both for quite some time Nomad used to have problems which were fixed.
Also please don’t get me wrong. I’m not arguing against the operator pattern as a whole. I think however that the current implementation has a severe problemn due to people adding very custom forms of complexity at least making onboarding and long term maintenance a topic one should not underestimate.
Since Nomad I’d very thin compared to Kubernetes I’m really curious about those errors that one will be the first to encounter. Did you have any experience? I ran into Nomad bugs before but I clearly wasn’t the first to encounter them because they usually were already fixed in the upcoming release and migration was trivial.
I also ran into multiple Kubernetes bugs which also were fixed. However there were breaking changes in the way.
These are however individual day to day problems, so certainly doesn’t say anything about either software.
Would love to hear some horror stories. Infrastructure in our company has gotten complex enough that I’m now facing the decision of either moving most of the services that can be easily containerized into a Kubernetes cluster or using Nomad to bring some consistency to the deployment and orchestration stories. I’ve started a migration to Kubernetes once before and the tools blew chunks, so Nomad’s simple concepts and seemingly straightforward conversion path seems very tempting.
I’d love to hear what’s your take on running the Nomad controllers themselves and general pitfalls that might help me make a decision (in general, Kubernetes’ most enticing feature is I don’t have to run the cluster).
Reading that article ^^, would you have guessed gpu passthrough was a part of the nomad (host) agent configuration?
Yes, and it sounds like a pretty reasonable guess given that document presents both the agent configuration and the example job spec making use of it. k8s gpu support also requires a driver deployed to all gpu nodes in addition to the pod spec using it, so your assumption seems like it might be based on someone else having already configured this in your k8s experience.
Given the nature of the horror stories (as such), other comments on this post, and the labeling of apparently any nomad deployment as “bespoke” – almost satire given an environment without CRDs, operators, and dozens of vendor supplied distributions – I’m struggling to see the technical side of your axe to grind (popularity/sales issues, difficulty with documentation, trouble hiring qualified ops)
Sorry, I don’t get how the second story is an inherent Nomad issue. Could you please elaborate? To me it seems that the customer clearly didn’t know what they want. The same problem would be with any system, even if started with k8s migrating in the end to Nomad.
Can I hire folks with shared experience in my tech stack?
If my tech stack needs to be replicated (think Gitlab, ELK, PaaS private cloud offerings), could others do so easily?
This one hits home. For one, if I decide to go with Nomad I will end up doing most of the DevOps myself because nobody else will be familiar with it (at least someone else in the team is familiar with Kubernetes). Then there’s the requirement of having to launch regional clusters if/when we start working with European customers or one of our customers decides to pay us a ton of money to do an on-premise installation.
Damn, this alone might be enough to tilt the scales in favor of Kubernetes, despite my complete hatred for managing state using YAML files.
But do keep going, I think it’ll be good for other people as well!
if I decide to go with Nomad I will end up doing most of the DevOps myself because nobody else will be familiar with it
K8s is big because everyone is jumping the hype train without ever questioning do they need a full blown K8s cluster or not. Nomad has a minimal learning curve and is pretty easy to get started with. To put differently, yes you can find engineers who know how to deploy on K8s, can you also find skilled enough engineers who can debug an obscure issue on K8s under the hood and patch it? Cause those engineers, would be awesome anyway regardless of Nomad/K8s.
This is a red flag for me for any company hiring if they limit their choices of people to employ based on frameworks/tools. What’s cool today, may not even exist tomorrow.
Have you opened the issues and discussed with the maintainers? Cause, for memory limits I’ve tried and they do work (on OSS edition). Regarding others, they aren’t descriptive enough for me to comment on. Which version of Nomad were you running and yes, were any of the issues you faced reported on their issue tracker?
I have the exact opposite experience, memory limits are hard but CPU limits are the minimum you want to allocate to the task but it can burst over if there’s resources.
Nomad in production since 2018 and still using it..
It won’t. OSS nomad will not enforce cgroup memory limits
Do you have a source for any of this? Were you maybe using the raw_exec driver rather than exec, or booting without cgroup_enable=memory swapaccount=1 (also a thing for k8s)? There’s no task driver resource limit feature related to enterprise, and this is contrary to other folks experience, so the insistence is beginning to sound like FUD
I think they are 2 different things. Quotas apply on entire Namespace. The individual memory limits of task are still on the resources section. Shall confirm it anyway
Excellent write-up! I’ve given a talk on a number of occasions about why Nomad is better than Kubernetes, as I too can’t think of any situations (other than the operator one you mention), where I think kubernetes is a better fit.
Hey, yes I’ve definitely seen your talk :D Thanks for the feedback!
Watched your talk and have some points of disagreement:
One thing we can agree 100% is that neither Kubernetes or Nomad should be the default tool for solving any problem, and that we should prefer solutions that are simpler, easier to reason about.
I think you accidentally the link to the talk.
Fixed, thanks :)
Nomad is cool, because it may work with technologies other than Docker containers. For example, Nomad can be used to orchestrate FreeBSD jails: https://papers.freebsd.org/2020/fosdem/pizzamig-orchestrating_jails_with_nomad_and_pot/
And it’s
exec
command is isolated with achroot
, which makes it super useful when migrating non-containerised workloads too.Anyone got any experience with that? Seems like a nice way to run plain binaries without having to use docker images (For example when the binaries are compiled with Go).
I used the java one and the exec ones. It worked great, especially if you don’t require any special libraries already in the system.
We’ve been using the
java
driver in production for over 2 years now, we also us theexec
driver for smaller tools, basically shell scripts to backup the consul and nomad database.I’ve used the exec in presentation demos, where I am running a cluster of nomad VMs, and I have an directory mounted to the host with the apps to run.
I could of course host a docker registry in the host, but it’s not worth the hassle; I’d rather have simpler demos with less to go wrong!
Ah good, other people are blazing this trail too. Thanks for the writeup, it validates some thoughts in my head. I’m also rebuilding my homelab setup with Hashistack (mostly because Nomad) and Tailscale. 😃
When I was starting to look into orchestrators at work, @pondidum’s talk on Nomad really resonated with my initial thoughts on Kubernetes and lead me to look at Nomad more seriously as an alternative. Massively useful, huge thanks for writing the talk.
Nice writeup, thanks! I’m also thinking of taking the Nomad route, as I would like to stay away from k8s as far as possible.
One question related to VPN? Is that for connecting to internal services when traveling, or you use VPN even at home? If the latter, what is the benefit? I’m trying to understand do I need one if all my machines are at home.
Actually Tailscale is a mesh network. So only the traffic to IP ranges under Tailscale CIDR flows via Tailscale.
And yes I use them at home also. My server is a DO droplet, so practically no difference if I’m at home or travelling. But even at home, I’ve 2 RPi nodes and I just prefer to host anything internal on Tailscale ranges (so that any new device gets the access automatically, don’t have to fiddle with local IPs or local DNS resolvers). Makes the setup pretty clean :)
[Comment removed by author]
There’s quite some big enterprise businesses that most likely you and everyone here rely on on a regular basis that completely base on Nomad.
9/10 times the error won’t be helpful. In my experience that’s not true. Something that bothers me about Kubernetes is that it’s rather easy to create silent problems. In fact that’s something I hunt down a lot in Kubernetes setups and it’s becoming a skill, but it certainly speaks against Kubernetes. Looking at a clearly non functional system where everything lights up green can be frustrating. Don’t get me wrong, there were clear mistakes behind those and those weren’t bugs in Kubernetes. The reality still is that this is a typical problem in Kubernetes setups and not as common in Nomad setups. There’s various reasons for that.
I’ve had the chance to work with both Kubernetes and Nomad setups in big production setups. Both work. But due to complexity you are fast more likely to be the first one to be the first to have an issue on kubernetes. Well maybe other than Amazon or Google, but that doesn’t matter cause again due to complexities of not just Kubernetes itself, but Operators once things fail they tend to fail badly.
Another reason for not recommending Kubernetes for big enterprise setups is stability. Kubernetes for better or for worse develops rather quickly and has a quite big amount of breaking changes. While in a small start-up that’s something you can have done after a bit of work (and don’t underestimate that) in a bigger setting I have seen more than one company significantly falling behind on kubernetes versions.
With the rise of operators I see that problem only growing.
Most of the mentioned problems don’t exist with Nomad. Sure there’s other things that Kubernetes does better, but based on the last few years of experience if someone would ask me for a recommendation using the current state of both Kubernetes and and Nomad I’d strongly suggest Nomad. Outside of cloud providers I also know of bigger Nomad than Kubernetes setups.
This is something that changed though and wasn’t always so. While I have been interested in both for quite some time Nomad used to have problems which were fixed.
Also please don’t get me wrong. I’m not arguing against the operator pattern as a whole. I think however that the current implementation has a severe problemn due to people adding very custom forms of complexity at least making onboarding and long term maintenance a topic one should not underestimate.
Since Nomad I’d very thin compared to Kubernetes I’m really curious about those errors that one will be the first to encounter. Did you have any experience? I ran into Nomad bugs before but I clearly wasn’t the first to encounter them because they usually were already fixed in the upcoming release and migration was trivial.
I also ran into multiple Kubernetes bugs which also were fixed. However there were breaking changes in the way.
These are however individual day to day problems, so certainly doesn’t say anything about either software.
[Comment removed by author]
Would love to hear some horror stories. Infrastructure in our company has gotten complex enough that I’m now facing the decision of either moving most of the services that can be easily containerized into a Kubernetes cluster or using Nomad to bring some consistency to the deployment and orchestration stories. I’ve started a migration to Kubernetes once before and the tools blew chunks, so Nomad’s simple concepts and seemingly straightforward conversion path seems very tempting.
I’d love to hear what’s your take on running the Nomad controllers themselves and general pitfalls that might help me make a decision (in general, Kubernetes’ most enticing feature is I don’t have to run the cluster).
[Comment removed by author]
Yes, and it sounds like a pretty reasonable guess given that document presents both the agent configuration and the example job spec making use of it. k8s gpu support also requires a driver deployed to all gpu nodes in addition to the pod spec using it, so your assumption seems like it might be based on someone else having already configured this in your k8s experience.
Given the nature of the horror stories (as such), other comments on this post, and the labeling of apparently any nomad deployment as “bespoke” – almost satire given an environment without CRDs, operators, and dozens of vendor supplied distributions – I’m struggling to see the technical side of your axe to grind (popularity/sales issues, difficulty with documentation, trouble hiring qualified ops)
[Comment removed by author]
Sorry, I don’t get how the second story is an inherent Nomad issue. Could you please elaborate? To me it seems that the customer clearly didn’t know what they want. The same problem would be with any system, even if started with k8s migrating in the end to Nomad.
Without knowing details, #2 sounds more like a dumbass customer than nomad horror story.
Then again, if k8s can shield us from dumbass customers, it has much value indeed.
[Comment removed by author]
This one hits home. For one, if I decide to go with Nomad I will end up doing most of the DevOps myself because nobody else will be familiar with it (at least someone else in the team is familiar with Kubernetes). Then there’s the requirement of having to launch regional clusters if/when we start working with European customers or one of our customers decides to pay us a ton of money to do an on-premise installation.
Damn, this alone might be enough to tilt the scales in favor of Kubernetes, despite my complete hatred for managing state using YAML files.
But do keep going, I think it’ll be good for other people as well!
K8s is big because everyone is jumping the hype train without ever questioning do they need a full blown K8s cluster or not. Nomad has a minimal learning curve and is pretty easy to get started with. To put differently, yes you can find engineers who know how to deploy on K8s, can you also find skilled enough engineers who can debug an obscure issue on K8s under the hood and patch it? Cause those engineers, would be awesome anyway regardless of Nomad/K8s.
This is a red flag for me for any company hiring if they limit their choices of people to employ based on frameworks/tools. What’s cool today, may not even exist tomorrow.
[Comment removed by author]
[Comment removed by author]
Did you also use k8s in production and could comment on that? I’m just wondering if both are horrible (everything is horrible), or just one?
[Comment removed by author]
[Comment removed by author]
Have you opened the issues and discussed with the maintainers? Cause, for memory limits I’ve tried and they do work (on OSS edition). Regarding others, they aren’t descriptive enough for me to comment on. Which version of Nomad were you running and yes, were any of the issues you faced reported on their issue tracker?
[Comment removed by author]
[Comment removed by author]
The docs look outdated. Quoting from https://www.nomadproject.io/docs/commands/namespace:
Shall open an issue to fix the tutorial website.
Regarding Quotas on Namespace, yes that seems to be Enterprise only https://www.nomadproject.io/docs/commands/quota
But I believe the task will get OOM Killed if the memory usage exceeds the one defined in https://www.nomadproject.io/docs/job-specification/resources#memory-1
Shall try this, thanks!
[Comment removed by author]
I have the exact opposite experience, memory limits are hard but CPU limits are the minimum you want to allocate to the task but it can burst over if there’s resources.
Nomad in production since 2018 and still using it..
Yep. Same from my experience.
Do you have a source for any of this? Were you maybe using the
raw_exec
driver rather thanexec
, or booting withoutcgroup_enable=memory swapaccount=1
(also a thing for k8s)? There’s no task driver resource limit feature related to enterprise, and this is contrary to other folks experience, so the insistence is beginning to sound like FUDI have over two years worth of OOM killed services in both staging and production. I don’t quite follow how my experience could be seen as FUD.
I was referring to the comment that limits don’t work. I very much agree with you and, thankfully rarely, see OOMs too w/ 5y+ in production
Oh, sorry. I thought it was the OP (delux), my bad. I was on my phone
I think they are 2 different things. Quotas apply on entire Namespace. The individual memory limits of task are still on the
resources
section. Shall confirm it anyway