As soon as I saw consul-templaterb I knew this was gonna be good. :)
Big shout out to Pierre and the Consul folks at Criteo, their tools are absolutely top-notch and consul-templaterb has me using ERBs for everything even though I’m not a huge fan of Ruby.
I’m wondering if they’ve split Consul into two clusters, one for services+health, and one for KV. We’ve had to do that internally due to a few teams being really write-heavy on the KV endpoints and write scaling being the ‘hardest’ part of Consul. I’m totally fine with running a billion clusters because operationally Consul is pretty darn lightweight compared to some other stuff. It gets hairy when things go really sideways but it’s troubleshoot-able even in a terrible state thankfully.
consul experts: avert your eyes
I am looking away, respectfully.
For any endpoint in the Consul HTTP API, if something changes, we get a refresh of all the data for that endpoint, which means incremental changes, which happen every few seconds, are expensive.
Fly people I hope you read this! Use streaming, it’s just been greatly improved starting with 1.11 (backported to 1.10.6 or .7?) ! Pay attention to the upcoming rate limiting deep within the system too which will allow you to run a single massive cluster and keep Raft writes flowing even in the face of huge problems.
Love the usage of “attache.” I think this is a solution to this problem that is implemented a lot.
There’s no one endpoint that we can long-poll for all the nodes.
Isn’t this just v1/catalog/nodes? Something that I’ve switched us to is using the agent api wherever possible, unfortunately this doesn’t allow long polls, but you can absolutely hammer the ever living crap out of these endpoints with zero effect, as to look at ‘attache’ again, it’s only ever considering local state. Not just that, but it’s “perfectly consistent” as there’s no network or distribution happening there. :)
Also thinking about this, I wonder if the v1/health/state/any endpoint would’ve worked. That’s a biiiig endpoint, but you can long-poll on it and cache it up!
I’ve watched the Hashicorp stuff from afar but never actually used it. The reference to NATS made me wonder if they considered using Serf, and if so, why they went with NATS instead. Serf does not seem to get much airtime for some reason.
Since you like questions…I assume you are migrating away from nomad because it no longer fits your abstractions. While I can guess, what are those reasons? From the sound it seems you are migrating to a home grown solution.
As soon as I saw
consul-templaterb
I knew this was gonna be good. :)Big shout out to Pierre and the Consul folks at Criteo, their tools are absolutely top-notch and consul-templaterb has me using ERBs for everything even though I’m not a huge fan of Ruby.
I’m wondering if they’ve split Consul into two clusters, one for services+health, and one for KV. We’ve had to do that internally due to a few teams being really write-heavy on the KV endpoints and write scaling being the ‘hardest’ part of Consul. I’m totally fine with running a billion clusters because operationally Consul is pretty darn lightweight compared to some other stuff. It gets hairy when things go really sideways but it’s troubleshoot-able even in a terrible state thankfully.
I am looking away, respectfully.
Fly people I hope you read this! Use streaming, it’s just been greatly improved starting with 1.11 (backported to 1.10.6 or .7?) ! Pay attention to the upcoming rate limiting deep within the system too which will allow you to run a single massive cluster and keep Raft writes flowing even in the face of huge problems.
Love the usage of “attache.” I think this is a solution to this problem that is implemented a lot.
Isn’t this just v1/catalog/nodes? Something that I’ve switched us to is using the agent api wherever possible, unfortunately this doesn’t allow long polls, but you can absolutely hammer the ever living crap out of these endpoints with zero effect, as to look at ‘attache’ again, it’s only ever considering local state. Not just that, but it’s “perfectly consistent” as there’s no network or distribution happening there. :)
Also thinking about this, I wonder if the v1/health/state/any endpoint would’ve worked. That’s a biiiig endpoint, but you can long-poll on it and cache it up!
The last time I checked, streaming wasn’t supported for the catalog endpoints. Has that changed?
v1/catalog/nodes doesn’t give us the per-service metadata, which is the reason we’re individually long-polling all the node endpoints.
I’ve watched the Hashicorp stuff from afar but never actually used it. The reference to NATS made me wonder if they considered using Serf, and if so, why they went with NATS instead. Serf does not seem to get much airtime for some reason.
We used to use Serf to gossip load; I wrote a blog post about Serf, because I like it too:
https://fly.io/blog/building-clusters-with-serf/
Ultimately NATS was just easier to get working across Go and Rust for us, and it’s more flexible. But they’re pretty similar systems.
Thanks for sharing that with me!
Thanks for asking! :)
Since you like questions…I assume you are migrating away from nomad because it no longer fits your abstractions. While I can guess, what are those reasons? From the sound it seems you are migrating to a home grown solution.
@tptacek answered this on the orange site: https://news.ycombinator.com/item?id=30863610
Bump. :) I was interested to hear what the plan was there also.
https://lobste.rs/s/spvdwx/consul_at_fly_io#c_adpa5o