Greenspun’s Tenth Rule famously states: “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.”
Maybe Hébert’s Tenth Rule could be: “Any sufficiently complicated Kubernetes deploy contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of OTP.”
Perhaps kubernetes would never have happened if runtime hot code reloading were available in Java/Python/etc ?
I quit my last job because we switched from shipping turnkey hardware to running in the cloud with kubernetes. I was suddenly on call one week out of four, and I got to sleep through maybe three of the seven nights I was on call. I was also specifically forbidden from working on the code that kept me from sleeping because “we’re just about to replace it”. I finally went to the VP and said the next time I was put on call, I would put in my two week notice. Three months later I showed up in the on call rotation (they said it was an accident) and I put in my two weeks notice.
Horror everyday story. Without much thinking of it, “specfiically forbidden from working on the code that kept me from sleeping” also sounds like a bad thing.
I love this because it describes an excellent production and deployment cycle.
I hate this because, dang, it appears we keep posting these essays and nobody is listening. The tech community has solved a slew of problems that others are currently suffering through. Sadly it seems like nobody is listening.
Anything I can do to help amplify the message, let me know.
The future is here, it’s just not evenly distributed
Software development, as a field, changes slowly, often requiring a generation for a real change to happen, it seems. Sometimes people are listening, and just can’t get past a certain hangout associated with a piece of tech (Erlang’s syntax comes to mind as a complaint, as well as the fact that, until recently, it evolved well outside of the dominant traditions of programming).
For many, having a backstop at all is an improvement.
Is it possible to radically change the way we think of software one incremental step at a time? Language designers and many others seem to think so: they continuously “steal” ideas from elsewhere and cram them into whatever they’re currently doing.
Is the fact that Erlang developed so far outside of “normal” development the reason it seems to work so well? In other words, does the community itself, regardless of which cool kids are ascendant currently, have so many rocks and obstacles built into how it thinks about software that it’ll never progress significantly if left to its own druthers?
I believe it is. Typescript would be my primary example there, as the sort of gradual type system there is the sort of thing, that 15-20 years ago, was only available in Common Lisp.
2.a) I don’t think it’s Erlang being niche that makes it work so well. There are a lot of other niche technologies that don’t work as well as Erlang, some of them you just don’t hear about, others have had their time come and go. ClipperBasic or Evelop Basic come to mind as being completely outclassed by now. AutoHotkey is another example of a niche (though less so) tool that does what it needs to do, but is reported to be very messy under the hood.
2.b) I don’t think it is helpful to think of programming as a singular “community”. There are a bunch of communities in programming, from BSD to Unity to Godot to Vue.js to Wordpress. But, in spite of that, communities don’t exist in a vacuum. There is cross-pollination of ideas, and forks, and so on. So, most of those communities aren’t left completely to their own druthers, and even if they were, after a given amount of time, people in those communities would be likely to start looking for ways to improve things.
Granted, there are some sorts of ideas that are harder to communicate than others, and some ideas are very wide spread across many communities.
Nice article, ferd. One thing that resonated with me was
To me, abandoning all these live upgrades to have only k8s is like someone is asking me to just get rid of all error and exceptions handling and reboot the computer each time a small thing goes wrong.
I think this stems from how complex software has gotten and the myriad of dependencies that an average developer/team has to deal with. More often than not it’s just simple to pull the rug underneath and start afresh. Throw in some redundancy there and another layer of abstraction via a LB and your clients are mostly unaware of the shit that you’re dealing with. Where all of this breaks down is when this pattern is being applied to workloads where it makes no sense and you need fine-grained control. At this point, I feel k8s has become like a hammer and developers/evangelists are hoping to change every problem as a nail to use k8s on it.
PS: Was the hot reload thing at Heroku? :D I’m at Heroku now and I wish I knew more erlang/gotten a chance to work with you.
The first story was on Logplex, which I think finally got taken down / replaced earlier this year after the 3rd or 4th attempt by people to write its Go replacement (though I do not have first-hand reports on its demise). The routing layer never had live code updates during my time there: although it could have benefited from it, the imperative to preserve transient state was much lower, and instead we did rolling restarts with a disk cache and without node replacements on there (full node replacement came with CVEs, more major security updates, or unexplained bad performance where rotating to a new instance made sense as a resolution).
I learned so much by working on Logplex with Ferd; in particular by observing how well the early design decisions allowed it to grow and gracefully respond to whatever we threw at it. That’s probably the most I’ve ever learned at any job in the shortest time.
This is what social coding is: network effects overpowering everything.
It feels similar to some of what drove me crazy about webdev: the misguided belief that tools will somehow replace experience and skills, the constant reinvention and re-complecting of things, and the complete lack of interest in the historical perspective of computing.
I’ve never worked with Erlang. The author alludes to the benefits of hot reloading over, say, blue/green deployments, but doesn’t specify (other than caches and keeping in-flight connections, the former of which can be a service like Redis/Memcache and the latter of which is handled elegantly by any half decent reverse proxy loadbalancer).
Could someone please shed some light on why the earlier circumstance they describe is so much better than blue/green deployments?
This isn’t rhetoric, I’m genuinely curious, because they seem to be getting at something here of which I’m ignorant.
Erlang can have multiple versions of an object loaded at the same time.
If you have a long running connection that’s using an object, and a new version of that object is loaded, the in-use version stays in memory until the connection ends.
Outside of Erlang I’ve seen that done with a load balancer and enforcing short term connections (as you describe). In my last job we were using a protocol that could not support that (SIP/SDP/RTP), so we had to wait until traffic was low and shutdown everything for upgrading. That often went poorly, I wish we’d been using Erlang instead!
Even though development “best practice” has moved toward horizontal, stateless server pools, sometimes there is no decent substitute for having state in server memory, zero network hops away, latency in nanoseconds rather than milliseconds, and unconnected to any shared/cross-server state. And in the case of maintaining connections, half-decent load balancers can perhaps keep connections alive from the client point of view, but they cannot synchronize connection state across the servers that are handling the same client.
Hot reloading on the Erlang VM means the engineer has a choice in how they want to perform any given deployment. We can (and many of us do) use blue/green deployment in the common case, but our application designs are not constrained by blue/green being the only viable rollout strategy.
“hours of rollouts and draining and reconnection storms with state losses.”
I work with a platform that’s mostly built from containers running services (no k8s here though, if that’s important), but the above isn’t familar to me.
State doesn’t get lost: load balancers drain connections and new tasks are spun up and requests go to the new tasks.
When there’s a failure: Retries happen.
When we absolutely have to have something work (eventually) or know everything about the failure: Persistent queues.
The author doesn’t specify what’s behind the time necessary for rollouts. I’ve seen some problematic services, but mostly rollout takes minutes - and a whole code/test/security scan/deploy to preprod/test/deploy to prod/test/… cycle can be done in under an hour, with the longest part being the build and security scanning.
The author also talks about required - and scheduled - downtime. Again I don’t know why the platform(s) being described would necessarily force such a requirement.
Here’s one example: the service may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on.
Rolling the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. If you have a fleet with 300 instances requiring 10 minutes to shut down from connection drains, and that they take 10 minutes to come back up and re-sync their state and come back to top performance, rolling them in batches of 10 (because you want it somewhat gradual and to let times for things to rebalance) will take roughly 10 hours, longer than a working day.
I do have some services which work in a similar way, come to think of it - loading some mathematical models and warming up before they’re performing adequately - along similar timescales.
I think we’ve been lucky enough not to be working at the number of instances you’ve described there, or to have fast enough iteration on the models for us to need to release them as often as daily.
For Erlang to be taken out of the picture where it was working nicely doing what (OTP) does best does sound painful.
We have something similar, at much larger numbers than described. We cut over new traffic to the new service versions, and keep the old service versions around for 2-4 weeks as the long tail of work drains.
On the other hand, the article mentioned something like “log in to repl and shoot the upgrade” seems like a manual work. I would think that having 1 hour of manual work over 10 hours of automated rollout have different tradeoffs.
As for the fleet shutdown calculation, you can also deal with that differently. You can at the very least halve the time by first bringing the new instances up, then shutting down the old ones, so your batch doesn’t take 20 minutes, but 10. If you want to “let times things for things to rebalance”, you still have to do that in the system that you described in the article.
Now, I’m not saying that I didn’t agree a lot with what you wrote there. But I did get a wibe that seem to be talking down on containers or k8s, and not comparing the tradeoffs. But mostly I do agree with a lot of what you’ve said.
You can at the very least halve the time by first bringing the new instances up, then shutting down the old ones, so your batch doesn’t take 20 minutes, but 10.
That doesn’t sound right. It takes 10 minutes to bring up new instances and 10 minutes to drain the old ones them, at least that’s my understanding. Changing around the order of these steps has the advantage of over-provisioning such that availability can be guaranteed but the trade-off is (slightly‽) higher cost short-term (10h in that example). Doing these 2 steps in parallel is of course an option and probably what you suggest.
What is the business explanation behind such moves? How can anyone trade a working and reliable system with something that is way more complex to set up and keep functional? Are people really that afraid of Erlang’s syntax and fear of not finding developers? Does one Erlang engineer cost as a team of people managing k8s cluster?
My guess is moving a bunch of independently managed and sufficiently different systems into a generic runtime environment. That way people can use common tools to manage the environment. Nice in some cases, but seems like it discounts the runtime environment of languages, which is a particular advantage of Erlang.
Another thing is talent aquisition. You’re far more likely to run into a decent Java dev than a decent Erlang dev. (Come to think of it, can you meet an in-decent Erlang dev?)
the best hiring decisions I was part of is when we were doing IoT development on BSD devices with Erlang.
We assumed nobody would know our stack and had to set up the whole thing from the ground up with the assumption we needed to teach everyone how some things work.
Never been part of a team where we had such an easy time on-boarding people and making them productive in a short time. Turns out it’s much easier to teach the local product and business logic when training each other is part of the team culture, and the benefits extend further than language.
I’m sure it’s a bit self selecting too: you get people who are interested in learning about these things so they tend to be more motivated. Get a bunch of motivated people together and you have something special.
I feel like this piece makes sense as an argument against monolith applications shoe-horned into Kubernetes, but not so much for an application that fits under the hand-wavy category of “cloud native”.
Greenspun’s Tenth Rule famously states: “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.”
Maybe Hébert’s Tenth Rule could be: “Any sufficiently complicated Kubernetes deploy contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of OTP.”
Perhaps kubernetes would never have happened if runtime hot code reloading were available in Java/Python/etc ?
I quit my last job because we switched from shipping turnkey hardware to running in the cloud with kubernetes. I was suddenly on call one week out of four, and I got to sleep through maybe three of the seven nights I was on call. I was also specifically forbidden from working on the code that kept me from sleeping because “we’re just about to replace it”. I finally went to the VP and said the next time I was put on call, I would put in my two week notice. Three months later I showed up in the on call rotation (they said it was an accident) and I put in my two weeks notice.
Horror everyday story. Without much thinking of it, “specfiically forbidden from working on the code that kept me from sleeping” also sounds like a bad thing.
That’s awful. Glad you got out. Hope you found a better place.
I both love and hate this.
I love this because it describes an excellent production and deployment cycle.
I hate this because, dang, it appears we keep posting these essays and nobody is listening. The tech community has solved a slew of problems that others are currently suffering through. Sadly it seems like nobody is listening.
Anything I can do to help amplify the message, let me know.
Software development, as a field, changes slowly, often requiring a generation for a real change to happen, it seems. Sometimes people are listening, and just can’t get past a certain hangout associated with a piece of tech (Erlang’s syntax comes to mind as a complaint, as well as the fact that, until recently, it evolved well outside of the dominant traditions of programming).
For many, having a backstop at all is an improvement.
I wonder two things.
Is it possible to radically change the way we think of software one incremental step at a time? Language designers and many others seem to think so: they continuously “steal” ideas from elsewhere and cram them into whatever they’re currently doing.
Is the fact that Erlang developed so far outside of “normal” development the reason it seems to work so well? In other words, does the community itself, regardless of which cool kids are ascendant currently, have so many rocks and obstacles built into how it thinks about software that it’ll never progress significantly if left to its own druthers?
2.a) I don’t think it’s Erlang being niche that makes it work so well. There are a lot of other niche technologies that don’t work as well as Erlang, some of them you just don’t hear about, others have had their time come and go. ClipperBasic or Evelop Basic come to mind as being completely outclassed by now. AutoHotkey is another example of a niche (though less so) tool that does what it needs to do, but is reported to be very messy under the hood.
2.b) I don’t think it is helpful to think of programming as a singular “community”. There are a bunch of communities in programming, from BSD to Unity to Godot to Vue.js to Wordpress. But, in spite of that, communities don’t exist in a vacuum. There is cross-pollination of ideas, and forks, and so on. So, most of those communities aren’t left completely to their own druthers, and even if they were, after a given amount of time, people in those communities would be likely to start looking for ways to improve things.
Granted, there are some sorts of ideas that are harder to communicate than others, and some ideas are very wide spread across many communities.
Nice article, ferd. One thing that resonated with me was
I think this stems from how complex software has gotten and the myriad of dependencies that an average developer/team has to deal with. More often than not it’s just simple to pull the rug underneath and start afresh. Throw in some redundancy there and another layer of abstraction via a LB and your clients are mostly unaware of the shit that you’re dealing with. Where all of this breaks down is when this pattern is being applied to workloads where it makes no sense and you need fine-grained control. At this point, I feel k8s has become like a hammer and developers/evangelists are hoping to change every problem as a nail to use k8s on it.
PS: Was the hot reload thing at Heroku? :D I’m at Heroku now and I wish I knew more erlang/gotten a chance to work with you.
The first story was on Logplex, which I think finally got taken down / replaced earlier this year after the 3rd or 4th attempt by people to write its Go replacement (though I do not have first-hand reports on its demise). The routing layer never had live code updates during my time there: although it could have benefited from it, the imperative to preserve transient state was much lower, and instead we did rolling restarts with a disk cache and without node replacements on there (full node replacement came with CVEs, more major security updates, or unexplained bad performance where rotating to a new instance made sense as a resolution).
I miss classic logplex, that thing could take a surprising amount of punishment and keep on smiling
I learned so much by working on Logplex with Ferd; in particular by observing how well the early design decisions allowed it to grow and gracefully respond to whatever we threw at it. That’s probably the most I’ve ever learned at any job in the shortest time.
Oh cool you’re ex herokai too?
Yeah, 2011-2014; worked on codon/git, buildpacks, and then logplex at the end.
https://github.com/heroku/logplex/commits?author=technomancy
This is what social coding is: network effects overpowering everything.
It feels similar to some of what drove me crazy about webdev: the misguided belief that tools will somehow replace experience and skills, the constant reinvention and re-complecting of things, and the complete lack of interest in the historical perspective of computing.
I’ve never worked with Erlang. The author alludes to the benefits of hot reloading over, say, blue/green deployments, but doesn’t specify (other than caches and keeping in-flight connections, the former of which can be a service like Redis/Memcache and the latter of which is handled elegantly by any half decent reverse proxy loadbalancer).
Could someone please shed some light on why the earlier circumstance they describe is so much better than blue/green deployments?
This isn’t rhetoric, I’m genuinely curious, because they seem to be getting at something here of which I’m ignorant.
Erlang can have multiple versions of an object loaded at the same time.
If you have a long running connection that’s using an object, and a new version of that object is loaded, the in-use version stays in memory until the connection ends.
Outside of Erlang I’ve seen that done with a load balancer and enforcing short term connections (as you describe). In my last job we were using a protocol that could not support that (SIP/SDP/RTP), so we had to wait until traffic was low and shutdown everything for upgrading. That often went poorly, I wish we’d been using Erlang instead!
Even though development “best practice” has moved toward horizontal, stateless server pools, sometimes there is no decent substitute for having state in server memory, zero network hops away, latency in nanoseconds rather than milliseconds, and unconnected to any shared/cross-server state. And in the case of maintaining connections, half-decent load balancers can perhaps keep connections alive from the client point of view, but they cannot synchronize connection state across the servers that are handling the same client.
Hot reloading on the Erlang VM means the engineer has a choice in how they want to perform any given deployment. We can (and many of us do) use blue/green deployment in the common case, but our application designs are not constrained by blue/green being the only viable rollout strategy.
“hours of rollouts and draining and reconnection storms with state losses.”
I work with a platform that’s mostly built from containers running services (no k8s here though, if that’s important), but the above isn’t familar to me.
State doesn’t get lost: load balancers drain connections and new tasks are spun up and requests go to the new tasks.
When there’s a failure: Retries happen.
When we absolutely have to have something work (eventually) or know everything about the failure: Persistent queues.
The author doesn’t specify what’s behind the time necessary for rollouts. I’ve seen some problematic services, but mostly rollout takes minutes - and a whole code/test/security scan/deploy to preprod/test/deploy to prod/test/… cycle can be done in under an hour, with the longest part being the build and security scanning.
The author also talks about required - and scheduled - downtime. Again I don’t know why the platform(s) being described would necessarily force such a requirement.
Here’s one example: the service may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on.
Rolling the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. If you have a fleet with 300 instances requiring 10 minutes to shut down from connection drains, and that they take 10 minutes to come back up and re-sync their state and come back to top performance, rolling them in batches of 10 (because you want it somewhat gradual and to let times for things to rebalance) will take roughly 10 hours, longer than a working day.
I do have some services which work in a similar way, come to think of it - loading some mathematical models and warming up before they’re performing adequately - along similar timescales.
I think we’ve been lucky enough not to be working at the number of instances you’ve described there, or to have fast enough iteration on the models for us to need to release them as often as daily.
For Erlang to be taken out of the picture where it was working nicely doing what (OTP) does best does sound painful.
We have something similar, at much larger numbers than described. We cut over new traffic to the new service versions, and keep the old service versions around for 2-4 weeks as the long tail of work drains.
It sucks. I really wish we had live reloading.
On the other hand, the article mentioned something like “log in to repl and shoot the upgrade” seems like a manual work. I would think that having 1 hour of manual work over 10 hours of automated rollout have different tradeoffs.
As for the fleet shutdown calculation, you can also deal with that differently. You can at the very least halve the time by first bringing the new instances up, then shutting down the old ones, so your batch doesn’t take 20 minutes, but 10. If you want to “let times things for things to rebalance”, you still have to do that in the system that you described in the article.
Now, I’m not saying that I didn’t agree a lot with what you wrote there. But I did get a wibe that seem to be talking down on containers or k8s, and not comparing the tradeoffs. But mostly I do agree with a lot of what you’ve said.
That doesn’t sound right. It takes 10 minutes to bring up new instances and 10 minutes to drain the old ones them, at least that’s my understanding. Changing around the order of these steps has the advantage of over-provisioning such that availability can be guaranteed but the trade-off is (slightly‽) higher cost short-term (10h in that example). Doing these 2 steps in parallel is of course an option and probably what you suggest.
What is the business explanation behind such moves? How can anyone trade a working and reliable system with something that is way more complex to set up and keep functional? Are people really that afraid of Erlang’s syntax and fear of not finding developers? Does one Erlang engineer cost as a team of people managing k8s cluster?
My guess is moving a bunch of independently managed and sufficiently different systems into a generic runtime environment. That way people can use common tools to manage the environment. Nice in some cases, but seems like it discounts the runtime environment of languages, which is a particular advantage of Erlang.
Another thing is talent aquisition. You’re far more likely to run into a decent Java dev than a decent Erlang dev. (Come to think of it, can you meet an in-decent Erlang dev?)
the best hiring decisions I was part of is when we were doing IoT development on BSD devices with Erlang. We assumed nobody would know our stack and had to set up the whole thing from the ground up with the assumption we needed to teach everyone how some things work.
Never been part of a team where we had such an easy time on-boarding people and making them productive in a short time. Turns out it’s much easier to teach the local product and business logic when training each other is part of the team culture, and the benefits extend further than language.
I’m sure it’s a bit self selecting too: you get people who are interested in learning about these things so they tend to be more motivated. Get a bunch of motivated people together and you have something special.
Sounds like you have the topic of another blog post if you so desire :)
I feel like this piece makes sense as an argument against monolith applications shoe-horned into Kubernetes, but not so much for an application that fits under the hand-wavy category of “cloud native”.