1. 16
  1.  

  2. 10

    I don’t really see a lot of smaller open source projects having their own LTS releases.

    What I see is them suffering from trying to support their ecosystem’s LTS releases. When CentOS ships with a Python that the Python core team has dropped, I’m stuck supporting it in my packages, because there are users there (and they blame me, not CentOS, for their troubles if I drop support).

    1. 2

      I don’t understand CentOS, is enterprise really so inflexible with a shorter release cycle?

      1. 12

        Yes. Change is bad. Especially when you have scary SLAs (that is, downtime on your end costs your company thousands of dollars per minute) you tend to be very careful about what and when you upgrade, especially if things are working (if it ain’t broke, don’t fix it).

        1. 1

          I wonder why we don’t make better software / devops to handle change. Maybe the pain to change once in a while is less than to roll with tighter LTS windows?

          1. 7

            Because starting to use a new methodology to handle change is a change on its own. And so a new technology can only climb the scale relatively slowly (so many projects half our size have used this technology that we can as well run a small trial). This means that some importants kinds of feedback are received with a timescale of years, not weeks…

            1. 4

              Exactly, its not that enterprises don’t want to change its that change in and of itself is hard. It is also expensive in time, which means money. Which basically means: keep things as static as possible to minimize when things break. If you have N changes amongst M things, debugging what truly broke is non trivial. Reducing the scope of a regression is probably the number one motivator of never upgrading things.

              An example, at work we modify the linux kernel for $REASONS, needless to say, testing this is just plain hard. Random fixes to one part of the kernel can drastically alter how often say the OOM killer triggers. Sometimes you don’t see issues until several weeks of beating the crap out of things. When the feedback cycle is literally a month, I am not sure one could argue that we want more change to be possible.

              I don’t see much of a way to improve the situation beyond just sucking it up and accepting that certain changes cannot be rushed without triggering unknown unknowns. Even with multi week testing you still might miss regressions.

              This is a different case entirely than making an update to a web application and restarting.

              1. 2

                First of all, thanks for a nice presentation of this kind of issues.

                This is a different case entirely than making an update to a web application and restarting.

                I am not sure what you mean here, because a lot of web applications have a lot of state, and a lot of inner structure, and a lot of rare events affecting the behaviour. I don’t want to deny that your case is more complicated than many, I just want to say that your post doesn’t convey that it is qualitatively different as opposed to quantitatively.

                I am not sure one could argue that we want more change to be possible.

                What you might have wanted, is comparing throwing more hardware at the problem (so that you can run more code in a less brittle part of the system) with continuing with the current situation. And then there would be questions of managing deployments and their reproducibility, possiibility or impossibility of redundancy, fault isolation, etc. I guess in your specific case the current situation is optimal by some parameters.

                Then of course, the author of the original post might effectively have an interest — opposing to yours — in making your current situation more expensive to maintain (this could be linked to a change that might make something they want less expensive to get). Or maybe not.

                How much do the things discussed in the original post apply to your situation by the way? Do you try to cherry-pick fixes or to stabilize an environment with minimal necessary minor-version upgrades?

                1. 3

                  I am not sure what you mean here, because a lot of web applications have a lot of state, and a lot of inner structure, and a lot of rare events affecting the behaviour. I don’t want to deny that your case is more complicated than many, I just want to say that your post doesn’t convey that it is qualitatively different as opposed to quantitatively.

                  I’m not entirely sure I can reasonably argue that kernel hacking isn’t qualitatively different from say web application development but here goes. Mind that some of this is going to be specific to the use cases I encounter and thus can be considered an edge case, however edge cases are always great for challenging assumptions you may not realize you had.

                  Lets take the case of doing 175 deployments in one day that another commenter linked. For a web application, there are relatively easy ways of doing updates with minimal impact to end users. This is mostly possible as the overall stack is so far removed from the hardware, its relatively trivial to do. Mind you I’m not trying to discount the difficulty but overall it amounts to some sort of HA or load balancing say via dns, haproxy, etc… to handle flipping a switch from old version to new.

                  One might also have an in application way to do A/B version flips in place in the application as well, whatever the case here, the ability to update is in lots of ways a feature of the application space.

                  A con to this very feature is that restarting the application and deploying a new version inherently destroys the state the application is in. Aka: lets say you have a memory bug, restarting it fixes it magically but you upgrade so often you never notice it. This is a case where I am almost 99% sure that any user space developer would catch bugs if they were to run their application for longer than a month. Now I doubt that will happen but its something to contemplate. The ability to do rapid updates is a two edged sword.

                  Now lets compare to the kernel. Lets take a trivial idea like adding a 64 bit pointer to the skb buffer. Easy right? Shouldn’t impact a thing, its just 64 bits what is 64 bits of memory amongst friends? Well a lot it turns out, lets say you’re running network traffic at 10Gb/s all the while where you have a user space application using up as much memory as it can. Probably overcommitting memory as well just to be annoying. Debugging why this application triggers the OOM killer after a simple change like I described is definitely non trivial. The other problem is you need to trigger the exact circumstances to hit the bug. And worst of all it can often be a confluence of bugs that trigger it. Aka some network driver will leak a byte every so often once some queue is over a certain size, meaning you have to run stuff a long time to get to that state again.

                  I’m using a singular example but I could give others where the filesystem can similarly play into the same stats.

                  Note, since I’m talking about linux lets review the things that a kernel update cannot reasonably do, namely update in place. This severely limits how a user space application can be run and for how long. Lets say this user space application can’t be shutdown without some effect on the users end goal. Unreasonable? Sure, but note that a lot of runtime processes are not designed with rapid updates and things like checkpointing so they can be re-run from a point in time snapshot. And despite things like ksplice to update the kernel in place, it has…. limitations to trying to update things. Limitations relating to struct layout tend to cause things to go boom.

                  In my aforementioned case, struct layout and the impact on memory can also severely change how well user space code runs. Say you add another byte to a struct that was at 32bytes of memory already. Now you’re requiring 40bytes of memory per struct. This means that its likely your’e wasting 24 bytes of memory and hurting caching of data in the processor in ways you might not know. Lets say you decide to make it a pointer, now you’re hitting memory differently and also causing changes to the overall behavior of how everything runs.

                  I’m only scratching the surface here, but I’m not sure how one can arrive at kernel development isn’t qualitatively different state wise than a web application. I’m not denigrating web app developers either here, but I don’t know of many web app developers worrying about adding a single byte to a struct as the performance impact causes more cache invalidation and making things ever so slower for what a user space process sees. They both involve managing state, but making changes in the kernel can be frustrating when a simple 3 line change can cause odd space leaks in how user applications run. If you’re wondering why Linus is such a stickler about breaking user space, its because its really easy to do.

                  I also wish I could magically trip up every heisenbug related to long running processes abusing the scheduler, vm subsystem, filesystem, and network but much like any programmer, bugs at the boundary are hard to replicate. Its also hard to debug when all you’ve got is a memory image of the state of the kernel when things broke. What happened leading up to that is normally the important part but entirely gone.

                  What you might have wanted, is comparing throwing more hardware at the problem (so that you can run more code in a less brittle part of the system) with continuing with the current situation. And then there would be questions of managing deployments and their reproducibility, possiibility or impossibility of redundancy, fault isolation, etc. I guess in your specific case the current situation is optimal by some parameters.

                  Not sure its optimal but a bit of a casus belli in that if you have to run a suite of user land programs that have been known to trigger bad behavior, and run them for a month straight to be overly certain things aren’t broken, throwing more hardware at it doesn’t make the baby any faster. Just like throwing 9 women at the making a baby problem won’t make a baby any faster, sometimes the time it takes to know things work just can’t be reduced. You can test more in parallel at once sure, but even then, you run into cost issues for the hardware.

                  How much do the things discussed in the original post apply to your situation by the way? Do you try to cherry-pick fixes or to stabilize an environment with minimal necessary minor-version upgrades?

                  Pretty much that, cherry-pick changes as needed, and stick to a single kernel revision. Testing is mostly done on major version changes, aka upgrading from version N to version M reapply the changes and let things loose to see what the tree shaking finds on the ground. Then debugging what might have introduced the bug and fixing that along with more testing.

                  Generally though the month long runs tend to be freaks of nature bugs. But god are they horrible to debug.

                  Hopefully that helps explain my vantage point a bit. If its unconvincing feel free to ask for more clarification. Its hard to get too specific due to legal reasons but I’ll try to do as well as I can. Lets just say, I envy every user space application as to the debugging tools they have. I wish to god the kernel had something like rr to debug back in time to watch a space leak as an example.

                  1. 1

                    Thanks a lot.

                    Sorry for a poor word choice — I meant that the end-goal problems you solve are on a continuum with no bright cutoffs that passes through the tasks currently solved by the most complicated web systems, by other user-space systems, by embedded development (let’s say small enough to have no use for FS), other kinds of kernel development etc. There are no clear borders, and there are large overlaps and crazy outliers. I guess if you said «orders of magnitude», I would just agree.

                    On the other hand, poor word choice is the most efficient way to make people tell interesting things…

                    I think a large subset of examples you gave actually confirm the point I have failed to express.

                    Deploying web applications doesn’t have to reset the process, it is just that many large systems now throw enough hardware to reset the entire OS instance. Reloading parts of the code inside the web application works fine, unless a library leaks an fd on some rare operations and the server process fails a week later. Restarting helps, that’s true. Redeploying a new set of instances takes more resources, needs to be separately maintained, but allows to shrug off some other problems (many of which you have enumerated).

                    And persistent state management still requires effort for web apps, but less than before more resources were thrown at it.

                    I do want to hope that at some point kernel debugging (yes, device drivers excluded, that’s true) by running Bochs-style CPU emulator under rr becomes feasible. After all, this is a question of throwing resources at the problem…

                    1. 1

                      Deploying web applications doesn’t have to reset the process, it is just that many large systems now throw enough hardware to reset the entire OS instance. Reloading parts of the code inside the web application works fine, unless a library leaks an fd on some rare operations and the server process fails a week later. Restarting helps, that’s true. Redeploying a new set of instances takes more resources, needs to be separately maintained, but allows to shrug off some other problems (many of which you have enumerated).

                      Correct, but this all depends upon the application. A binary for example would necessarily have to be restarted somehow, even if it means re exec()‘ing the process to get at the new code. Unless you’re going to dynamically load in symbols on something like a HUP it seems a bit simpler to just do a load balanced type setup and bleed off connections then restart and let connections trickle back in. But I don’t know I’m not really a web guy. :)

                      I do want to hope that at some point kernel debugging (yes, device drivers excluded, that’s true) by running Bochs-style CPU emulator under rr becomes feasible. After all, this is a question of throwing resources at the problem…

                      I highly doubt that will ever happen, but I wish it would. But qemu/bochs etc… all have issues with perfect emulation of cpus sadly.

            2. 2

              It’s not like we don’t have the software. GitHub deployed to production 175 times in one day back in 2012. Tech product companies often do continuous deployment, with gradual rollout of both app versions across servers and features across user accounts, all that cool stuff.

              The “enterprise” world is just not designed for change, and no one seems to be changing that yet.

            3. 1

              if it ain’t broke, don’t fix it

              And if it isn’t seriously affecting profitability yet, it ain’t broke.

              Even if there are known unpatched vulnerabilities which expose people you have a duty to to increased risk.

          2. 1

            The article recommendation seems to be to create a separate branch for LTS backports (so that the new development can initially happen easier) and (maybe gradually) handing over most of the control of the backports. Unless these users are a significant share of the project contributors already (regardless of form of contribution).

            Whether this recommendation is aligned with your motivation for the project is another question.

          3. 7

            Who is this aimed at?

            Is the author suggesting Node.JS shouldn’t provide an LTS? Linux?

            I don’t understand.

            1. 6

              I interpreted this as being directed at smaller projects within the Node.js community, like Gulp, which requires 0.10 compatibility for changes. Gulp is a task runner primarily used for build pipelines in the frontend communities.

              Despite being a very small core team, Gulp versions 3 and 4 continue to support Node.js 0.10 lineage. This means the project deeply cares about how its dependencies are written and what features they use, and those maintainers feel the burn when dependencies change to use the latest and greatest. This, naturally, makes it more difficult for contributors to develop new features and provide new contributions.

              Why Node.js 0.10? This is a lineage that long pre-dates the Node Foundation and io.js. What makes it still relevant? It’s the version that’s still supported by the Debian LTS team in Wheezy, and soon by the LTS team in Jessie. It will presumably be the lineage shipped until April 2020 when the LTS expires.

              It was important to the developers of Gulp that they supported the versions a user trying to replace ad-hoc shell scripts would have available to them.

              1. 4

                Debian will use an old version of Gulp because this is a contract they have with their users. If Debian randomly and regularly upgraded programs to new versions of things that behave differently (different/incompatible command-line arguments, etc), then many people would probably not use Debian.

                If the Gulp developers don’t backport security and bug fixes, then either the Debian package maintainers will do it, the security/bug fix will be in Debian making Gulp developers look stupid, or Debian won’t ship with Gulp.

                So I get why Gulp developers will backport as long as it’s easy enough, but I don’t get why the author cares what Gulp does.

                1. 3

                  In general I agree, but consider that the release notes for the most recent couple Debian versions have had great big “NO SECURITY FIXES FOR NODE/V8” on them; I don’t know that it helps that much to have Gulp doing the right thing when Node itself is such a mess that the Debian team gave up and labeled it a lost cause due to the high-volume torrent of CVEs they produce.

                  1. 1

                    Interesting. I’ve not noticed that small paragraph in Chapter 5^1 of the releases notes before. I downloaded the referenced debian-security-support, but I didn’t see anything inside mentioning the lack of security or LTS releases for nodejs, libv8, or node-* packages.

                    It’s possible I’ve just not found the relevant bits.

                    1. 2

                      I re-read them and they are definitely not as emphatic as I remember. They are, however, extremely sarcastic:

                      Unfortunately, this means that libv8-3.14, nodejs, and the associated node-* package ecosystem should not currently be used with untrusted content, such as unsanitized data from the Internet.

                2. 2

                  I never thought Gulp would be the shining beacon of how to do things right, but here we are. What a great example to follow!

                  1. 1

                    Yea I agree to some extent. Long term support is important, and it’s actually not that difficult so long as you have a good set of automated unit and/or integration tests that you run from your CI system.

                    The last company I worked at had over 100 unit tests per microservice and that made it really easy to quickly update dependencies or move to entirely new platforms. If we do a big update and something breaks, we can just add a new test to prevent it from happening in the future. Is something not relevant anymore? Make sure it’s covered in the integration tests and discard the old unit tests.

                    There’s nothing wrong with long term support, so long as you’re not supporting legacy stuff that isn’t maintained anymore or you have dependencies you haven’t updated in forever that are rotting. (That being said, you shouldn’t update jars unless you need to for features and security, but it’s good to keep things as up to date as possible because if package A depends on X 0.12 and B depends on X 1.13, something like sbt will pull in the later one which could break everything. We had this problem with json4s … also never use json4s for anything ever).

                3. 4

                  I think the author explicitly says that it is aimed at relatively small projects. The recommendation is to discontinue previous releases as long as no contributor is actually using them (or paid by someone else to maintain them).

                  The author mentions that Node.JS long-term releases are maintained separately from the main development by people on enterprise-Node-users payroll.

                  1. 2

                    … okay. What small projects are they thinking about?

                    1. 2

                      No idea.

                      Maybe I am wrong and the sibling comment is right that this is about stopping the support for older releases of dependencies more than about older releases of the project per se.

                4. 4

                  I like the idea of (expensive) extended payed support. Most of the users will upgrade in a reasonable time frame, the slowest (and the richest) enterprises will pay a price.

                  1. 3

                    Sometimes software just works and isn’t worth upgrading. New versions often mean new pain for users. This is why I hate webapps which update whether you like it or not.

                    This is also why I love projects like Linux, and the C++ programming language. Messing with users is unacceptable. User needs over developer needs any day. Does it make writing software harder? Hell yes.

                    1. 1

                      Seemed to me to be aimed at some sort of power struggle going on with node.js…

                      1. 1

                        Projects have their own goals, and I don’t see why those should be dictated by distros.

                        I’m very much in favour of projects setting out their approach to support in a way that works for them. Ultimately, if $distro wants to maintain an ancient version of your work indefinitely, then good luck to them.

                        One project I’m involved in has a take on this which boils down to “we work on all versions of Ruby and Rails still in security support by upstream”. It felt like a reasonable trade-off to make, considering the finite amount of time we have to work on it.

                        1. 1

                          It seems to be aimed at a specific project, but which one?

                          I don’t think what he describes is a general attitude in open source. In my experience, the biggest issue in small projects is that many, even very popular ones, are pretty much unmaintained or abandoned with nobody with a permission to merge PR.

                          I’ve never seen a project that refuses PR due to an LTS policy. Some however don’t want to merge because it breaks backward compatibility, which maybe is what this article is talking about. But it’s generally a good thing to have someone to say “no” to a change if it’s going to break the build of dozens of users.