it might be useful to tap into the wider Kubernetes ecosystem, e.g. operators - if you want to run PostgreSQL, Redis, Cassandra, ElasticSearch, Kafka with limited human resources, it might be easier to do so via Kubernetes Operators (whether or not such operational complexity, even abstracted, is worth it with a limited team, is an entirely different discussion)
From personal experience, if you think this is why you want to use Kubernetes, think again. You have to deal with topics of the actual software you want to run, should they arise, you have to deal with problems that Kubernetes might throw at you. And now you add a whole new thing that touches both and is its own beast. And the only ones being able to really deal with it is people will deep knowledge in all three of these.
Also the idea of operators very much feels like workarounds for workarounds. Building an abstraction for an abstraction that abstracts the management of that abstraction hardly feels like good design. Even if we say they are just better abstractions.
What I’ve seen at multiple companies now is that eventually one ends up with sort of an operator-stack that is one big customized setup for that one specific company. Speaking about snowflakes and pets…
In other words, you should be very sure about this being the right approach if you build your production services on top of this.
Not to say Nomad is without flaws, but it’s easier to decide what you want or need.
And with it being simpler, but having similar concepts, even if you end up switching over to Kubernetes the “lost” work time will for most situations be lower, than the other way round.
This is all just subjective and personal experience and of course situation changes. Both projects are developing rather quickly, so mentioned things might do as well.
In short: Don’t just choose Kubernetes, because there is operators.
The Kubernetes ecosystem is massive. There are entire companies, tools and whole niches being built around it ( ArgoCD, Rook, Istio, etc. etc. etc.). In some cases tools exist only because Kubernetes is itself so complex - Helm, Kustomize, there are a bunch of web UIs and IDEs ( Octant, Kubevious, Lens, etc.), specialised tooling to get an overview into the state and security of your Kubernetes cluster ( Sonobuoy, kube-hunter, kube-bench, armosec, pixie). Furthermore, there are literally hundreds of operators that allow abstracting the running of complex software within Kubernetes.
While this is true I really wonder whether I am the only one who thinks that a lot of these are simply not great pieces software. I don’t mean to pick on them, and having used some I really appreciate the effort, but for the sake of honesty a lot of these are not nice to use in a productive manner, but have very annoying rough edges. I don’t want to get in on individual ones, but to give some examples. For operators you might have silent errors, which can be very creepy, esp. when the configuration has minor variations compared to the software, or automatism that fights you. For UIs and IDEs there are the typical “smaller project” things. Like hard to search through logs, interfaces hard to adapt, stuff is shown out of date, something named badly or confusingly, etc. It’s the kind of topics one has when an IDE offers support for something new, like it was with Git or other things a decade or so ago. When things are not polished, they at times might be worse than not using them and I switched back and forth a lot using them.
Nomad and Consul came a long way there over the last year as well. Their web interfaces used to be like that, but now they start to be quite nice to use. Also certainly not perfect, but they actually made certain third party tools obsolete.
In the end you still should know how to do stuff on the command line, no matter what you choose. It will come in handy.
I heartily agree with you about abstractions-on-abstractions. Most Operators are just ways to combine several Kubernets native components together into a single, proprietary, package. It’s like a Helm chart, but different so that only the developers of the Operator really know what’s going on. In my experience, using those kinds of Operators is more-or-less a waste of time since you have to either learn the Operator and all of it’s constructs or you could just learn the Kubernetes constructs and learn how they operate with one another. I am firmly in the latter camp; I am also in the camp that self-hosts and does not use Kubernetes at home because it’s not a good tool.
That being said, though, there is one place I can point to and give a two-thumbs-up-recommendation for an Operator. This is in places where the Operator actually provides new functionality in the Kubernetes API and not just an alternative abstraction for a Helm chart. The Operator in question is cert-manager. It provides functionality that Kubernetes does not provide natively and cannot be reasonable shoehorned into whatever it already provides. The new constructs map readily to a usable pattern that is easy to grok.
On the other hand, there is the RabbitMQ operator which just takes all of the functionality of a Helm chart and hides it in things that can’t be viewed without a lot of kubectl magic… There is a place for everything and everything in it’s place. Use cert-manager. Avoid all other Operators unless there is a firm understanding of the additional abstraction layer it necessitates.
IIUC, the promise of operators is that with just a simple API call, I can have, say, a database cluster that then maintains itself, replicates itself, backs itself up, recovers itself on a new node if something happens to the old master, etc. I already have that with AWS managed services like RDS. If a disaster happens while I’m asleep or on a plane (though the latter doesn’t happen much these days), I can be confident that the service will recover itself. Yet I doubt there are sysadmins at Amazon babysitting my specific AWS instance. That’s why it seems plausible, at least to me with my lack of expertise in this area, that a Kubernetes operator should be able to do the same thing.
Yes. The difference is that if something ends up not working (which I guess is the reason there is DevOps, SREs, etc.) with Amazon you call support, whereas with operators you hopefully have enough overview of the insides of the operator.
You also might end up fighting some automatism. So you should still really know what you are doing and don’t assume it will just do everything for you.
Or coming from a different angle. If everything worked as intended all of that wouldn’t be required. So I always wonder what happens if stuff breaks and the operator is another thing that can break and some of the bigger ones are pretty complex, having their own bugs. And since the initial thing you start off is a “disaster” you want to recover from I think defaulting to assuming everything will go fine from them one might not be the best approach.
Of course there are different operators there. This is not to say you cannot have a simple operator for a piece of software and it will make your life easier. There’s however also giant ones and by just simply installing it if something stops working and you rely on it it might your life a lot harder and outages a lot bigger. So what I mean is really, that you should know what it implies if you download some operator with all these nice features that is run by some big corporation that had a team build that operator for some integral piece of software. If they have a problem in the operator they will sure have someone capable of fixing the issue. The question is whether your team can do much more than filing a bug report and hoping it’s fixed soon. Like don’t start to get an understanding of it when the disaster is already happening.
Strongly second this. At work we use an in-house operator to maintain thousands of database clusters - but the operator is like a force multiplier or a bag of safe automations. It allows a small team to focus on the outlier cases, while the operator deals with the known hiccups.
The whole thing relies on the team understanding k8s, the databases and the operator. The second feature we built was a way to tell the operator to leave a cluster alone so a human could un-wedge it..
In my very limited experience with operators, they tend to be big old state machines that are hard to debug. It takes a very long time to get them working reliably and handle all the corner-cases.
Beware of wishful thinking. It’s plausible but in reality we’re not there yet. Operators might help, but they’ll also have crazy bugs where they amplify problems because some natural barrier has been erased and is now “simply an API call”. One classical example is such operators breaking pod collocation constraints, because it’s so easy to mess it up and miss the problem until an incident happen.
Sysadmins at Amazon are not babysitting your specific AWS instance, but how did they build something super reliable out of RDS ? I can only imagine, but I think they started with the widest range possible of known failures so they could put failover mechanisms in place to fight them. Then they added monitoring on the health of each Postgres instance, on the service provided, and on the failover mechanism themselves. Then they seated and waited for things to go red, fixed, and refined this for years, at scale. The scale helps here because if shows problems faster. I’m 100% sure most Kubernetes operators are not built that way.
I am one of several k8s admins at work and I really hate k8s. In the past I’ve been at another shop as a developer where we used DC/OS (marathon/mesos) which I found a lot easier from a developer perspective, but my own experiments with it made me want to stab that terrible Java scheduler that ate resources for no damn reason. (K8S is written in Go and is considerably leaner as far as resources, but a much bigger beast when it comes to config/deployment).
I’ve dabbled with Nomad before and I do know some advertising startups that actively use it for all their apps/jobs. If I was getting into the startup space again, I’d probably look at using it.
K8S is a hot mess of insane garbage. When it’s configured and running smoothly, a good scheduler helps a lot when doing deployments and rolling/zero-downtime updates. But they tend to consume a lot of nodes and it’s very difficult to go from 1 to 100 (Having your simple proof of concept running on just 1 system and then scale up to n adding redundancy and masters). Some people talk about minikube or k3s, but they’re not true 0 to scale systems.
I did a whole post on what I think about docker and scheduling systems a few years back:
You should look at juju. It uses LXC/LXD clustering to avoid a lot of the shortcomings of k8s (which are many and varied). Maybe Nomad is better, but it’s all expressed in a language named after the founding company. This in and of itself is enough reason to squint really hard and ask “why?”.
Also: https://github.com/rollcat/judo It’s like ansible, but written in Go and only for the most basic of all basic kinds of provisioning.
I look at it this way. HCL is a (from the README) a toolkit for building config lanuages… “Inspired by libucl, nginx configuration, and others.” yaml is a pain to hand edit when they get large (ie k8s). json is a pain too (no comments for example - as an aside, why are we (still) using serialization formats for config files!?). toml is … okay… but a bit strange to get the structure right. It brings consistency (mostly) between their own products and being open source, means others can adopt it as well.
My understanding is you can use JSON anywhere HCL is accepted by the tools as well, so if you’re generating it out of some other system you can emit JSON not have to emit HCL.
I much prefer writing HCL[2] for configuring things, it’s a little clearer than YAML (certainly less footguns, no) and supports comments unlike JSON.
It’s not the language itself that bothers me (it’s a little weird as I would rather use a more-universally-accepted solution, but that’s my personal preference and I do not impose that on anyone else). it’s that it is owned by a company that is known for taking products and making them closed and expensive. This is precisely what companies do, though, and it’s not too surprising. You can get an “enterprise” version of any product hashicorp builds. The question remains: will HCL ever be forced into an “enterprise” category? Will it ever force users to accept a license that they do not agree or pay to use it? YAML/JSON have the advantage of being community-built so I doubt that will ever happen to them.
I realize now that I’m grandstanding here and proclaiming the requirement of using FOSS – but I don’t wholeheartedly agree to that. I have no problem using proprietary software (I use several every day, in fact). I’m just remaining a little squinty-eyed at HCL specifically. I don’t know that I could bring myself to choose HCL for tasks at my day job for things that do not inherently require it.
That brings me full circle back to my point: be careful, HCL is born from a commercial entity that may not always play nice. Hashicorp has generally in the past, but there are examples of how the companies with the best intentions do not always keep their principles.
I would love to use Nomad but the agressive feature gating is a problem for me. I have to go through the sales pipeline and commit to an enterprise licence to use basic k8s features such as resource quotas, and audit logging.
I’m talking out loud here: Is “feature gating” a common word choice for charging for extra features? When I hear feature gating, I think of https://en.wikipedia.org/wiki/Feature_toggle
While I definitely agree with everything this article mentions, I’d go a bit further and say that if you can avoid implementing an orchestrator at all, you should - I’d rather use something like heroku for as long as reasonably possible, for example.
Even if people still pick Kubernetes, I think articles like this are still very useful, as it helps show that there is more than one way to deploy your applications, and maybe it will help people pick the best tool for the job (for varying definitions of best).
When we started working on a new project 2,5 years ago, I was certain we’d end up with GKE or another K8S provider. Instead, we’re still running on Heroku, since for our small team its benefits (simple deployment model, not too much ops work) outweigh the downsides (pricey compute resources, noisy neighbours.)
I’ve noticed that every time I get the itch to try K8S or Nomad, I end up with a prototype that’s technically brilliant, but makes many things more complex.
From personal experience, if you think this is why you want to use Kubernetes, think again. You have to deal with topics of the actual software you want to run, should they arise, you have to deal with problems that Kubernetes might throw at you. And now you add a whole new thing that touches both and is its own beast. And the only ones being able to really deal with it is people will deep knowledge in all three of these.
Also the idea of operators very much feels like workarounds for workarounds. Building an abstraction for an abstraction that abstracts the management of that abstraction hardly feels like good design. Even if we say they are just better abstractions.
What I’ve seen at multiple companies now is that eventually one ends up with sort of an operator-stack that is one big customized setup for that one specific company. Speaking about snowflakes and pets…
In other words, you should be very sure about this being the right approach if you build your production services on top of this.
Not to say Nomad is without flaws, but it’s easier to decide what you want or need.
And with it being simpler, but having similar concepts, even if you end up switching over to Kubernetes the “lost” work time will for most situations be lower, than the other way round.
This is all just subjective and personal experience and of course situation changes. Both projects are developing rather quickly, so mentioned things might do as well.
In short: Don’t just choose Kubernetes, because there is operators.
While this is true I really wonder whether I am the only one who thinks that a lot of these are simply not great pieces software. I don’t mean to pick on them, and having used some I really appreciate the effort, but for the sake of honesty a lot of these are not nice to use in a productive manner, but have very annoying rough edges. I don’t want to get in on individual ones, but to give some examples. For operators you might have silent errors, which can be very creepy, esp. when the configuration has minor variations compared to the software, or automatism that fights you. For UIs and IDEs there are the typical “smaller project” things. Like hard to search through logs, interfaces hard to adapt, stuff is shown out of date, something named badly or confusingly, etc. It’s the kind of topics one has when an IDE offers support for something new, like it was with Git or other things a decade or so ago. When things are not polished, they at times might be worse than not using them and I switched back and forth a lot using them.
Nomad and Consul came a long way there over the last year as well. Their web interfaces used to be like that, but now they start to be quite nice to use. Also certainly not perfect, but they actually made certain third party tools obsolete.
In the end you still should know how to do stuff on the command line, no matter what you choose. It will come in handy.
I heartily agree with you about abstractions-on-abstractions. Most Operators are just ways to combine several Kubernets native components together into a single, proprietary, package. It’s like a Helm chart, but different so that only the developers of the Operator really know what’s going on. In my experience, using those kinds of Operators is more-or-less a waste of time since you have to either learn the Operator and all of it’s constructs or you could just learn the Kubernetes constructs and learn how they operate with one another. I am firmly in the latter camp; I am also in the camp that self-hosts and does not use Kubernetes at home because it’s not a good tool.
That being said, though, there is one place I can point to and give a two-thumbs-up-recommendation for an Operator. This is in places where the Operator actually provides new functionality in the Kubernetes API and not just an alternative abstraction for a Helm chart. The Operator in question is cert-manager. It provides functionality that Kubernetes does not provide natively and cannot be reasonable shoehorned into whatever it already provides. The new constructs map readily to a usable pattern that is easy to grok.
On the other hand, there is the RabbitMQ operator which just takes all of the functionality of a Helm chart and hides it in things that can’t be viewed without a lot of kubectl magic… There is a place for everything and everything in it’s place. Use cert-manager. Avoid all other Operators unless there is a firm understanding of the additional abstraction layer it necessitates.
IIUC, the promise of operators is that with just a simple API call, I can have, say, a database cluster that then maintains itself, replicates itself, backs itself up, recovers itself on a new node if something happens to the old master, etc. I already have that with AWS managed services like RDS. If a disaster happens while I’m asleep or on a plane (though the latter doesn’t happen much these days), I can be confident that the service will recover itself. Yet I doubt there are sysadmins at Amazon babysitting my specific AWS instance. That’s why it seems plausible, at least to me with my lack of expertise in this area, that a Kubernetes operator should be able to do the same thing.
Yes. The difference is that if something ends up not working (which I guess is the reason there is DevOps, SREs, etc.) with Amazon you call support, whereas with operators you hopefully have enough overview of the insides of the operator.
You also might end up fighting some automatism. So you should still really know what you are doing and don’t assume it will just do everything for you.
Or coming from a different angle. If everything worked as intended all of that wouldn’t be required. So I always wonder what happens if stuff breaks and the operator is another thing that can break and some of the bigger ones are pretty complex, having their own bugs. And since the initial thing you start off is a “disaster” you want to recover from I think defaulting to assuming everything will go fine from them one might not be the best approach.
Of course there are different operators there. This is not to say you cannot have a simple operator for a piece of software and it will make your life easier. There’s however also giant ones and by just simply installing it if something stops working and you rely on it it might your life a lot harder and outages a lot bigger. So what I mean is really, that you should know what it implies if you download some operator with all these nice features that is run by some big corporation that had a team build that operator for some integral piece of software. If they have a problem in the operator they will sure have someone capable of fixing the issue. The question is whether your team can do much more than filing a bug report and hoping it’s fixed soon. Like don’t start to get an understanding of it when the disaster is already happening.
Strongly second this. At work we use an in-house operator to maintain thousands of database clusters - but the operator is like a force multiplier or a bag of safe automations. It allows a small team to focus on the outlier cases, while the operator deals with the known hiccups.
The whole thing relies on the team understanding k8s, the databases and the operator. The second feature we built was a way to tell the operator to leave a cluster alone so a human could un-wedge it..
In my very limited experience with operators, they tend to be big old state machines that are hard to debug. It takes a very long time to get them working reliably and handle all the corner-cases.
Beware of wishful thinking. It’s plausible but in reality we’re not there yet. Operators might help, but they’ll also have crazy bugs where they amplify problems because some natural barrier has been erased and is now “simply an API call”. One classical example is such operators breaking pod collocation constraints, because it’s so easy to mess it up and miss the problem until an incident happen.
Sysadmins at Amazon are not babysitting your specific AWS instance, but how did they build something super reliable out of RDS ? I can only imagine, but I think they started with the widest range possible of known failures so they could put failover mechanisms in place to fight them. Then they added monitoring on the health of each Postgres instance, on the service provided, and on the failover mechanism themselves. Then they seated and waited for things to go red, fixed, and refined this for years, at scale. The scale helps here because if shows problems faster. I’m 100% sure most Kubernetes operators are not built that way.
I am one of several k8s admins at work and I really hate k8s. In the past I’ve been at another shop as a developer where we used DC/OS (marathon/mesos) which I found a lot easier from a developer perspective, but my own experiments with it made me want to stab that terrible Java scheduler that ate resources for no damn reason. (K8S is written in Go and is considerably leaner as far as resources, but a much bigger beast when it comes to config/deployment).
I’ve dabbled with Nomad before and I do know some advertising startups that actively use it for all their apps/jobs. If I was getting into the startup space again, I’d probably look at using it.
K8S is a hot mess of insane garbage. When it’s configured and running smoothly, a good scheduler helps a lot when doing deployments and rolling/zero-downtime updates. But they tend to consume a lot of nodes and it’s very difficult to go from 1 to 100 (Having your simple proof of concept running on just 1 system and then scale up to n adding redundancy and masters). Some people talk about minikube or k3s, but they’re not true 0 to scale systems.
I did a whole post on what I think about docker and scheduling systems a few years back:
https://battlepenguin.com/tech/my-love-hate-relationship-with-docker-and-container-orchestration-systems/
You should look at juju. It uses LXC/LXD clustering to avoid a lot of the shortcomings of k8s (which are many and varied). Maybe Nomad is better, but it’s all expressed in a language named after the founding company. This in and of itself is enough reason to squint really hard and ask “why?”.
Also: https://github.com/rollcat/judo It’s like ansible, but written in Go and only for the most basic of all basic kinds of provisioning.
re: HCL
I look at it this way. HCL is a (from the README) a toolkit for building config lanuages… “Inspired by libucl, nginx configuration, and others.” yaml is a pain to hand edit when they get large (ie k8s). json is a pain too (no comments for example - as an aside, why are we (still) using serialization formats for config files!?). toml is … okay… but a bit strange to get the structure right. It brings consistency (mostly) between their own products and being open source, means others can adopt it as well.
My understanding is you can use JSON anywhere HCL is accepted by the tools as well, so if you’re generating it out of some other system you can emit JSON not have to emit HCL.
I much prefer writing HCL[2] for configuring things, it’s a little clearer than YAML (certainly less footguns,
no
) and supports comments unlike JSON.It’s not the language itself that bothers me (it’s a little weird as I would rather use a more-universally-accepted solution, but that’s my personal preference and I do not impose that on anyone else). it’s that it is owned by a company that is known for taking products and making them closed and expensive. This is precisely what companies do, though, and it’s not too surprising. You can get an “enterprise” version of any product hashicorp builds. The question remains: will HCL ever be forced into an “enterprise” category? Will it ever force users to accept a license that they do not agree or pay to use it? YAML/JSON have the advantage of being community-built so I doubt that will ever happen to them.
I realize now that I’m grandstanding here and proclaiming the requirement of using FOSS – but I don’t wholeheartedly agree to that. I have no problem using proprietary software (I use several every day, in fact). I’m just remaining a little squinty-eyed at HCL specifically. I don’t know that I could bring myself to choose HCL for tasks at my day job for things that do not inherently require it.
That brings me full circle back to my point: be careful, HCL is born from a commercial entity that may not always play nice. Hashicorp has generally in the past, but there are examples of how the companies with the best intentions do not always keep their principles.
[Comment removed by author]
I would love to use Nomad but the agressive feature gating is a problem for me. I have to go through the sales pipeline and commit to an enterprise licence to use basic k8s features such as resource quotas, and audit logging.
Which of the following is more painful to you:
It’s hard to split them out, they’re part of the same thing. Probably the sales pipeline is the worst bit.
I’m talking out loud here: Is “feature gating” a common word choice for charging for extra features? When I hear feature gating, I think of https://en.wikipedia.org/wiki/Feature_toggle
Ah, I see, here is an example of “feature gating” in such a context: https://growthhackers.com/questions/ask-gh-feature-gating-work
While I definitely agree with everything this article mentions, I’d go a bit further and say that if you can avoid implementing an orchestrator at all, you should - I’d rather use something like heroku for as long as reasonably possible, for example.
Even if people still pick Kubernetes, I think articles like this are still very useful, as it helps show that there is more than one way to deploy your applications, and maybe it will help people pick the best tool for the job (for varying definitions of best).
When we started working on a new project 2,5 years ago, I was certain we’d end up with GKE or another K8S provider. Instead, we’re still running on Heroku, since for our small team its benefits (simple deployment model, not too much ops work) outweigh the downsides (pricey compute resources, noisy neighbours.)
I’ve noticed that every time I get the itch to try K8S or Nomad, I end up with a prototype that’s technically brilliant, but makes many things more complex.
Looking forward to trying GKE Autopilot, though!
I tried out Nomad but getting a shell on a job is an Enterprise feature? Seems over aggressive in terms of pricing.
No?
nomad alloc exec <allocation-id> bash
gives you a shell in one allocation of a job.
Just tried again, works! My bad. Think I was a victim of https://github.com/hashicorp/nomad/issues/4567
Still only works sometimes…
oh, well I stand corrected. good to know :)
I tried “exec” from the UI - is that not the same?
Ah, yes, product tiering!
You really don’t want to use Nomad in production.
Aggressive feature gating was mentioned. I also just found it bafflingly flaky. An experience we never had with any of our K8S clusters.
Would you please share your experience that leads you to say this?
Yet I do. I find it delightfully easy to operate. I know others who run it in prod at larger scale too.