1. 15
  1. 28

    I just can’t shake the feeling that Kubernetes is Google externalizing their training costs to the industry as a whole (and I feel the same applies to Go).

    1. 9

      Golang is great for general application development, IME. I like the culture of explicit error handling with thoughtful error messages, the culture of debugging with fast unit tests, when possible, and the culture of straightforward design. And interfaces are great for breaking development into components which can be developed in parallel. What don’t you like about it?

      1. 12

        It was initially the patronizing quote from Rob Pike that turned me off Go. I’m also not a fan of gofmt [1] (and I’m not a fan of opinionated software in general, unless I’m the one controlling the opinions [2]). I’m also unconvinced about the whole “unit testing” thing [5]. Also, it’s from Google [3]. I rarely mention it, because it goes against the current zeitgeist (especially at the Orange Site), and really, what can I do about it?

        [1] I’m sorry, but opening braces go on their own line. We aren’t developing on 24 line terminals anymore, so stop shoving your outdated opinions in my face.

        [2] And yes, I realize I’m being hypocritical here.

        [3] Google is (in my opinion, in case that’s not apparent) shoving what they want on the entire industry to a degree that Microsoft could only dream of. [4]

        [4] No, I’m not bitter. Really!

        [5] As an aside, but up through late 2020, my department had a method of development that worked (and it did not involve anything resembling a “unit test”)—in 10 years we only had two bugs get to production. In the past few moths there’s been a management change and a drastic change in how we do development (Agile! Scrum! Unit tests über alles! We want push button testing!) and so far, we’ve had four bugs in production.

        Way to go!

        I should also note that my current manager retired, the other developer left for another job, and the QA engineer assigned to our team also left for another job (but has since come back because the job he moved to was worse, and we could really use him back in our office). So nearly the entire team was replaced back around December of 2020.

        1. 11

          I can’t even tell if this is a troll post or not.

          1. 1

            I can assure you that I’m not intentionally trolling, and those are my current feelings.

          2. 2

            I’m sorry, but opening braces go on their own line. We aren’t developing on 24 line terminals anymore, so stop shoving your outdated opinions in my face.

            I use a portrait monitor with a full-screen Emacs window for my programming, and I still find myself wishing for more vertical space when programming in curly-brace languages such as Go. And when I am stuck on a laptop screen I am delighted when working on a codebase which does not waste vertical space.

            Are you perhaps younger than I am, with very small fonts configured? I have found that as I age I find a need for large and larger fonts. Nothing grotesque yet, but I went from 9 to 12 to 14 and, in a few places, 16 points. All real 1/72” points, because I have my display settings configured that way. 18-year-old me would have thought I am ridiculous! Granted, you’ve been at your current employer at least 10 years, so I doubt you are 18🙂

            I’m also unconvinced about the whole “unit testing” thing … my department had a method of development that worked (and it did not involve anything resembling a “unit test”)—in 10 years we only had two bugs get to production. In the past few moths there’s been a management change and a drastic change in how we do development (Agile! Scrum! Unit tests über alles! We want push button testing!) and so far, we’ve had four bugs in production.

            I suspect that the increase in bugs has to do with the change in process rather than the testing regime. Adding more tests on its own can only lead to more bugs if either incorrect tests flag correct behaviour as bugs (leading to buggy ‘bugfixes,’ or rework to fix the tests), or if correct tests for unimportant bugs lead to investing resources inefficiently, or if the increased emphasis leads to worse code architecture or rework rewriting old code to conform to the old architecture (I think I covered all the bases here). OTOH, changing development processes almost inevitably leads to poor outcomes in the short term: there is a learning curve; people and secondary processes must adapt &c.

            That is worth it if the long-term outcomes are sufficiently better. In the specific case of unit testing, I think it is worth it, especially in the long run and especially as team size increases. The trickiest thing about it in my experience has been getting the units right. I feel pretty confident about the right approach now, but … ask me in a decade!

            1. 2

              Are you perhaps younger than I am, with very small fonts configured?

              I don’t know, you didn’t give your age. I’m currently 52, and my coworkers (back when I was in the office) often complained about the small font size I use (and have used).

              I suspect that the increase in bugs has to do with the change in process rather than the testing regime.

              The code (and it’s several different programs that comprise the whole thing) was not written with unit testing in mind (even though it was initially written in 2010, it’s in C89/C++98, and the developer who wrote it didn’t believe in unit tests). We do have a regression test that tests end-to-end [1] but there are a few cases that as of right now require manual testing [2], which I (as a dev) can do, but generally QA does a more in-depth testing. And I (or rather, we devs did, before the major change) work closely with the QA engineer to coordinate testing.

              And that’s just the testing regime. The development regime is also being forced changed.

              [1] One program to generate the data required, and another program that runs the eight programs required (five of which aren’t being tested but need to be endpoints our stuff talks to) and runs through 15,800+ tests we have (it takes around two minutes). It’s gotten harder to add tests to it (the regression test is over five years old) due to the nature of how the cases are generated (automatically, and not all cases generated are technically “valid” in the sense we’ll see it in production).

              [2] Our business logic module queries two databases at the same time (via UDP—they’re DNS queries), so how does one automate the testing of result A returns before result B, result B returns before result A, A returns but B times out, B returns and A times out? The new manager wants “push button testing”.

              1. 1

                [2] Our business logic module queries two databases at the same time (via UDP—they’re DNS queries), so how does one automate the testing of result A returns before result B, result B returns before result A, A returns but B times out, B returns and A times out? The new manager wants “push button testing”

                Here are three options, but there are many others:

                1. Separate the networking code from the business logic, test the business logic
                2. Have the business logic send to a test server running on localhost, have it send back results ordered as needed
                3. Change the routing configuration or use netfilter to rewrite the requests to a test server, have it send back results ordered as needed.

                Re-ordering results from databases is a major part of what Jepsen does; you could take ideas from there too.

                1. 1
                  1. Even if that was possible (and I wish it was), I would still have to test the networking code to ensure it’s working, per the new regime.
                  2. That’s what I’m doing
                  3. I’m not sure I understand what you mean by “routing configuration”, but I do understand what “netfilter” is, and my response to that is—the new regime wants “push button testing,” and if there’s a way to automate that, then that is an option.
                  1. 2
                    1. Yes, of course the networking code would still need to be tested.

                      Ideally, the networking code would have its own unit tests. And, of course, unit tests don’t replace integration tests. Test pyramid and such.

                    2. 🚀

                    3. netfilter can be automated. It’s an API.

                    What’s push button testing?

                    1. 1

                      You want to test the program. You push a button. All the tests run. That’s it. Fully automated testing.

                      1. 1


                        Everything I’ve worked on since ~2005 has been fully and automatically tested via continuous integration. IMHO it’s a game changer.

            2. 1

              Would love to hear about your prior development method. Did adopting the new practices have any upsides?

              1. 4

                First off, our stuff is a collection of components that work together. There are two front-end pieces (one for SS7 traffic, one for SIP traffic) that then talk to the back-end (that implements the business logic). The back-end makes parallel DNS queries [1] to get the required information, muck with the data according to the business logic, then return data to the front-ends to ultimately return the information back to the Oligarchic Cell Phone Companies. Since this process happens as a call is being placed we are on the Oligarchic Cell Phone Companies network, and we have some pretty short time constraints. And due to this, not only do we have some pretty severe SLAs, but any updates have to be approved 10 business days before deployment by said Oligarchic Cell Phone Companies. As a result, we might get four deployments per year [2].

                And the components are written in a combination of C89, C++98 [3], C99, and Lua [4].

                So, now that you have some background, our development process. We do trunk based development (all work done on one branch, for the most part). We do NOT have continuous deployment (as noted above). When working, we developers (which never numbered more than three) would do local testing, either with the regression test, or another tool that allows us to target a particular data configuration (based off the regression test, which starts eight programs, five of which are just needed for the components being tested). Why not test just the business logic? Said logic is spread throughout the back-end process, intermixed with all the I/O it does (it needs data from multiple sources, queried at the same time).

                Anyway, code is written, committed (main line), tested, fixed, committed (main line), repeat, until we feel it’s good. And the “tested” part not only includes us developers, but also QA at the same time. Once it’s deemed working (using both regression testing and manual testing), we then officially pass it over to QA, who walks it down the line from the QA servers, staging servers and finally (once we get permission from the Oligarchic Cell Phone Companies) into production, where not only devops is involved, but QA and the developer who’s code is being installed (at 2:00 am Eastern, Tuesday, Wednesday or Thursday, never Monday or Friday).

                Due to the nature of what we are dealing with, testing at all is damn near impossible (or rather, hideously expensive, because getting actual cell phone traffic through the lab environment involves, well, being a phone company (which we aren’t), very expensive and hard to get equipment, and a very expensive and hard to get laboratory setup (that will meet FCC regulations, blah blah yada yada)) so we do the best we can. We can inject messages as if they were coming from cell phones, but it’s still not a real cell phone, so there is testing done during deployment into production.

                It’s been a 10 year process, and it has gotten better until this past December.

                Now it’s all Agile, scrum, stories, milestones, sprints, and unit testing über alles! As I told my new manager, why bother with a two week sprint when the Oligarchic Cell Phone Companies have a two year sprint? It’s not like we ever did continuous deployment. Could more testing be done automatically? I’m sure, but there are aspects that are very difficult to test automatically [5]. Also, more branch development. I wouldn’t mind so much this, except we’re using SVN (for reasons that are mostly historical at this point) and branching is … um … not as easy as in git. [6] And the new developer sent me diffs to ensure his work passes the tests. When I asked him why didn’t he check the new code in, he said he was told by the new manager not to, as it could “break the build.” But we’ve broken the build before this—all we do is just fix code and check it in [8]. But no, no “breaking the build”, even though we don’t do continuous integration, nor continuous deployment, and what deployment process we do have locks the build number from Jenkins of what does get pushed (or considered “gold”).

                Is there any upside to the new regime? Well, I have rewritten the regression test (for the third time now) to include such features as “delay this response” and “did we not send a notification to this process”. I should note that is is code for us, not for our customer, which, need I remind people, is the Oligarchic Cell Phone Companies. If anyone is interested, I have spent June and July blogging about this (among other things).

                [1] Looking up NAPTR records to convert phone numbers to names, and another set to return the “reputation” of the phone number.

                [2] It took us five years to get one SIP header changed slightly by the Oligarchic Cell Phone Companies to add a bit more context to the call. Five years. Continuous deployment? What’s that?

                [3] The original development happened in 2010, and the only developer at the time was a) very conservative, b) didn’t believe in unit tests. The code is not written in a way to make it easy to unit test, at least, as how I understand unit testing.

                [4] A prototype I wrote to get my head around parsing SIP messages that got deployed to production without my knowing it by a previous manager who was convinced the company would go out of business if it wasn’t. This was six years ago. We’re still in business, and I don’t think we’re going out of business any time soon.

                [5] As I mentioned, we have multiple outstanding requests to various data sources, and other components that are notified on a “fire and forget” mechanism (UDP, but it’s all on the same segment) that the new regime want to ensure gets notified correctly. Think about that for a second, how do you prove a negative? That is, something that wasn’t supposed to happen (like a component not getting notified) didn’t happen?

                [6] I think we’re the only department left using SVN—the rest of the company has switched to git. Why are we still on SVN? 1) Because the Solaris [7] build servers aren’t configured to pull from git yet and 2) the only redeeming feature of SVN is the ability to checkout a subdirectory, which given the layout of our repository, and how devops want the build servers configured, is used extensively. I did look into using git submodules, but man, what a mess. It totally doesn’t work for us.

                [7] Oh, did I neglect to mention we’re still using Solaris because of SLAs? Because we are.

                [8] Usually, it’s Jenkins that breaks the build, not the code we checked in. Sometimes, the Jenkins checkout fails. Devops has to fix the build server [7] and try the call again.

                1. 2

                  As a result, we might get four deployments per year [2]

                  AIUI most agile practices are to decrease cycle time and get faster feedback. If you can’t, though, then you can’t! Wrong practices for the wrong context.

                  I feel for you.

                  1. 1

                    Thank you! More grist for my “unit testing is fine in its place” mill.

                    Also: hiring new management is super risky.

          3. 43

            Tell me your job is operating kubernetes and you want job security without telling me that your job is operating kubernetes and you want job security.

            1. 8

              I find it disappointing that the top comment, with ~30 upvotes and unchallenged for several hours, is cynically questioning the author’s motives. Isn’t there already enough cynicism in the world? We should be better than that.

              1. 6

                It’s meant to be taken humorously. The author’s main argument is an appeal to expertise with the statement that he mixed the kool-aid. The rest of the article is based on personal opinion so there isn’t much else to say. If you have a similar experience to the author then you will agree, otherwise not.

                1. 2

                  I don’t know, every article about kubernetes is followed by some comments about how there are some conspiracies, and how anyone pro-Kubernetes must be some shill or insecure software engineer looking to hype the industry so they can have jobs. To me this sounds more like low quality troll comment than humor. There’s nothing technical or insightful in @cadey comment.

                  1. 10

                    My comments were meant to be taken humorously. This exact comment is in the vein of a rich kind of twitter shitposting of the model “tell me x without telling me x” as a way to point out ironic or otherwise veiled points under the surface of the original poster’s arguments. I am not trying to say anything about the author as a person (English is bad at encoding this kind of intent tersely) or anything about their skill level. I guess the true insight here is something along the lines of this Upton Sinclair quote:

                    It is difficult to get a man to understand something, when his salary depends on his not understanding it.

                    I also burned out on kubernetes clusterfucks so hard I almost quit tech, so there is a level of “oh god please do not do this to yourself” chorded into my sarcastic take.

              2. 2

                fwiw - I am currently unemployed and working on a startup in an unrelated space. I haven’t worked on Kubernetes in 2 years.

              3. 9

                I built Kubernetes tooling at Airbnb (previous employer) and am very happy Notion (current employer) is all ECS.

                1. 1

                  Do Notion have an engineering blog of some kind? I love learning more about the products I use.

                  1. 2

                    So far we only have one post: https://www.notion.so/blog/topic/tech

                2. 7

                  Yes you need a job orchestrator to abstract away your machines.

                  You should be running Hashicorp’s Nomad unless you are a Really Massive Shop With People To Spare On Operating Kubernetes.

                  In nomad I can run arbitrary jobs as well as run and orchestrate docker containers. This is something Kubernetes can’t do.

                  In nomad I can upgrade without many gymnastics. That feels good.

                  1. 13

                    Operating a simple kubernetes cluster isn’t that bad, especially with distributions such as k3s and microk8s.

                    You should be running Hashicorp’s Nomad unless you are a Really Massive Shop With People To Spare On Operating Kubernetes.

                    You should do what works for your team/company/whatever. There’s more than just Nomad and Kubernetes and people should be equipped to make decisions based on their unique situation better than someone on the internet saying they should be using their particular favorite orchestration tool.

                    1. 6

                      There’s more than just Nomad and Kubernetes

                      For serious deployment of containerised software, is there really? I did quite a bit of digging and the landscape is pretty much Nomad/Kubernetes, or small attempts at some abstraction like Rover, or Dokku for a single node. Or distros like OpenShift which are just another layer on Kubernetes.

                      I’m still sad about rkt going away…

                      1. 1

                        LXD is still around and developed!

                        1. 3

                          LXD is for system containers, not application containers. Pets, not cattle.

                          I really enjoy using LXD for that purpose, though. Feels like Solaris Zones on Linux (but to be honest, way less integrated than it should be because Linux is grown and illumosen are designed).

                      2. 2

                        Well said ngp. Whether you want to talk about Kubernetes or the class of orchestrators like it, it’s clear that there are enough tangible benefits to the developer and devops workflow that at least some of the paradigms or lessons are here to stay.

                      3. 2

                        No, I do not.

                        1. 2

                          I’ve only heard “you should run Nomad instead” in online comments, every time I hear about Nomad in person (eg from infra at Lob) it’s “we’re migrating from Nomad to X (usually Kubernetes) because Y”

                          1. 1

                            I haven’t tried Nomad yet, even though I heard nice things about it, and it seems they like Nix. What would be Ys that others list, what are the corner cases where one should avoid Nomad?

                            1. 1

                              I think when a lot of work would be saved by using existing helm charts.

                              Or when you need to integrate with specific tooling.

                              Or when you need to go to a managed service and it’s easier to find k8s.

                              And finally I think when you need to hire experienced admins and they all have k8s and not nomad.

                        2. 5

                          My goal with infrastructure is to forget its existence, both in maintenance and on my DO invoice. Until it’s as cheap as a systemd unit on a 5$ VPS, I don’t care about it for small-scale projects.

                          I think k8s is far more interesting for its API-based approaches (we’re bringing DCOM back), but I’d rather see the concepts implemented in something much simpler for again, small-scale things.

                          1. 4

                            imo, what’s really nice about using Kubernetes is

                            • everything is an object
                            • since you edit objects via an API, everything has an API
                            • everything is in its own netns, cgroup

                            We could absolutely build an infrastructure platform that does all of the above minus the boilerplate (and maybe even minus containers), but I don’t think that exists yet. These days I’m happy enough running Kubernetes everywhere (even single node) just so I don’t have to deal with netns myself.

                            1. 2

                              There is systemd which offers all of that minus boilerplate and minus containers.

                              1. 2

                                minus containers

                                plus containers, of course https://man7.org/linux/man-pages/man5/systemd.nspawn.5.html

                                1. 3

                                  Well, if you want, you can use them, but these aren’t required. That is why I said “minus containers”.

                                  1. 2

                                    Ah, I misunderstood you then.

                                2. 1

                                  systemd is great but you still need something to deploy those unit files. And you have to control what goes where, which is usually the job of a scheduler.

                                  For a single machine, systemd over k8s any day. Anything more than a handful, it’s debatable.

                                  1. 1

                                    I was planning on writing multi-node scheduler for systemd in Erlang/Elixir. Maybe one day.

                                    However even in basic form you can manage that - ship everything everywhere and then use socket activation and LB in front of everything to start services as needed.

                                  2. 1

                                    Well, not quite. I don’t want to write an essay here, but here’s one example of why it really doesn’t:

                                    Let’s say we decide we want to create network namespaces to isolate all our services, and then selectively poke holes between them for services to communicate. Let’s look at how we would solve this in Kubernetes vs systemd.

                                    In Kubernetes, we could create an object type (CustomResourceDefinition) called FirewallRule. We’d then hook into an HTTP API, which notifies our program of any changes to these objects. On change, some code runs to reconcile reality with the state of the object. Of course, Kubernetes transparently handles the creation of network namespaces, and provides a builtin object type to poke holes between them (including a layer of abstraction on top where programs running on separate machines look like they’re in the same namespace), so in reality we would just use that.

                                    In systemd, we cannot create custom object types. Instead, to create a network namespace, we would wrap our .service with a shell script to start up the network namespace. To poke holes, we might create another .service unit, which spawns a program with some arguments that specify properties (source, destination, etc). We have to be careful to specify that the second unit depends on the first unit starting (otherwise the netns doesn’t exist).

                                    Let’s say opening and closing a hole in the network is an expensive operation, but modifying the port number is cheap. All we have as input in systemd is a Start and Stop, so we’d have to open and close the hole when we modify the unit file (expensive). In Kubernetes we get a whole diff of {current state, desired state} to work with, so we can choose to just edit the port (cheap). In this way, systemd isn’t really a true declarative abstraction over running your infrastructure, and is more like a shell script where you can specify a list of steps and then run those steps in the background.

                                    That said, I don’t think containers are the cleanest way of having object-based declarative infrastructure. Maybe something like NixOS is a better way forward long-term (the tooling and adoption are currently… ehhh). But for now, if I have to pick between writing imperative infrastructure and running containers, I’m gonna run containers.

                                    1. 2

                                      In systemd, we cannot create custom object types.

                                      Depends on your definition of “creation of custom types”. Because you can create “instantiated types” that allows you to create “unit templates” that you can later depend on. For example units that create named netns. Then all you need to do in your service is to add dependency on given netns like:


                                      And your service will be started after network namespace is created.

                                      This also solves the second problem you specified:

                                      All we have as input in systemd is a Start and Stop, so we’d have to open and close the hole when we modify the unit file (expensive).

                                      As we need to modify only our application service, not the netns service, we can restart our application without bringing the network down at all. The same goes for service that would be used for opening and closing ports in firewall.

                                      So the same approach is fully possible in systemd without many hiccups. Actually I think that it can be made even clearer for the Ops, as there is no “magical custom thing” like your FirewallRule, but everything is using “regular” systemd facilities.

                                      1. 2

                                        I really like instantiated types; they make dealing with i.e. lots of site-to-site VPN links easy. I could see it used for other things; imagine describing a web service that slots into an application server as a unit like that, and have it automatically enroll in the web server’s routes/reverse proxy/whatever.

                                  3. 2

                                    Yeah, I feel the exciting part isn’t being able to build Google-scale container clusterfucks, but something like cPanel on top of clean APIs with a modern approach.

                                3. 1

                                  I’m a sad kubernetes maximalist. I think Matt is basically right, I just don’t enjoy it.

                                  1. 1

                                    I agree. The terminology is overloaded and the developer tools use different abstractions than the production systems. I wrote an internal doc at Google called “Towards Pod-native Tooling” before podman came around, because I saw that most of our developer tools were container-centric rather than pod-centric. It never got anywhere, but I incorporated some of the ideas into the tools I was working on at the time.