1. 7
  1.  

  2. 6

    Man, I would not want to admin a service where walking the dependency graph was in danger of exhausting stack.

    1. 3

      Well - that is a point. I’d expect that the number of services required to exhaust the stack would be pretty high, much higher than what you’d see on a normal system. Maybe reducing stack usage is an unnecessary concern, in this case. However, it’s always bothered me that the precise amount of stack available is something of a magic number, and that we rely so heavily on not exceeding that number even without knowing what it is, how much particular functions use, and in particular without having a reliable way to cleanly back out when we do overflow the stack.

      (the post was intended largely as an illustration of general principle - that of playing it safe, and not assuming resources will be available - rather than as an a solution to a problem that was actually being observed. This extends to many other things, but it’s hard to cover them in a concise blog post, and it’s hard to find time to write a lengthier blog post, unfortunately…)

      1. 6

        The author of https://bearssl.org/ did something super interesting. He implemented parts his code in a forth variant, but also coded a static checker to guarantee the maximum stack depth so he can allocate his coroutine stacks in a fixed size buffer with assurance it will not go out of bounds and without dynamic allocation.

        Super awesome stuff.

        1. 5

          Knowing how to avoid recursion, and converting to an explicit stack is definitely a useful skill. No argument.

          In this scenario, I might also simply consider a depth counter. Adding a limit is simpler, and one could argue should be the default for any recursive function. It’s still a guessing game to right size the limit, but one can usually make some assumptions, multiply by the fudge factor, and still come out with something that avoid catastrophic failure.

          1. 2

            Kind of a side note: some languages include an independent recursion depth counter which would be exceeded before the real stack is exhausted. For example, Python raises a catchable RunTimeError, allowing you do potentially recover. Not sure if C++ has such a feature.

            (No, I’m not advocating you should write an init system in Python :P )

            1. 1

              Not sure if C++ has such a feature.

              Nothing built-in. You could probably wrap a depth counter in a class which auto-incremented in the copy constructor, and pass it as a parameter to each function (but of course the function signature has to change accordingly).

      2. 4

        Something that I had to think about while reading this is a talk about OpenBSD rc.d. It’s very different and goes all the way for simplicity, deciding on limitations early on, while working around “rough edges” with decoupled tools.

        1. 2

          This kind of scheme is good when your service set is small and services don’t have any complicated dependency hierarchy - which is probably true on most systems. However, some frameworks do also manage service dependencies; it adds a significant amount of complexity, but it is a useful (and sometimes necessary) feature. Trade-offs do have to be made. The problem I’m facing with Dinit is I want this more powerful feature set without sacrificing robustness. (I’m aiming for a middle-ground in terms of complexity).

          But yes, for a lot of systems the rc.d scheme or a management system without first-class dependency management like s6 / runit / daemontools is fine. You can even fudge dependencies to some degree, usually, eg. by having the start script for one service also run the command to start a dependency.

          1. 2

            Not to forget about both OpenRC and FreeBSD rc.d, which both have had dependency management for ages.

            However, I have to say that while I myself have been writing quite a few init files (configs, scripts, units, name them as you please) and designing systems I actually don’t feel all that convinced about the benefits. Sure, it feels lighter, but if I think back of the times when I built complex systems with both dependency based and “numerically” ordered systems I don’t think either felt like giving me such a big benefit when setting them up, working with them or debugging. Whenever they failed in interesting ways both gave me benefits and stuff that held me back.

            A system where services start in linear order sometimes gave me the benefit of having a good overview, which made it easier to go over the whole thing, one by one. However this also means that one really has to go over each of them.

            If I am able to have more complex dependency graphs it’s easier to just look at a part of it and (hopefully) not having to care about other parts. It makes changes easier, if it is really completely bug free that is, which is hard to test because unit files way too often are written in a fairly stand alone way “I need network” is frequently all that is checked for and it’s hard to blame an author when it is a flexible service. However what pops up way more frequently here is that your setup doesn’t fit what the unit file author for example had in mind. Another thing I frequently see, also in the systemd world is different names for the same dependency/service or the same name for dependencies expecting different states of the system. To stick with a basic example: Does network up mean that the interface is up, that DHCP worked, that OpenVPN is also up, that the local DNS cache is up, etc.

            So while I prefer full featured support for dependencies I think it most is because I am more used to these systems. I am not so convinced of them objectively being better, also not in big setups. I have certainly seen them fail in ways that would have been harder with a stupid list of targets that get worked through one by one.

            I see this independent of parallel startups of services which again can be tricky and fail in their own way. However while this is an argument I frequently hear against certain systems I have to say that problems in this area usually only surfaced underlying problems (buggy unit files for example). So in practice I right now cannot think of an instance where this was a problem for me. Only on the desktop, where certain applications were started before the network was fully up.

            1. 1

              Just make the part that orchestrates it deterministic with checks on the dynamic aspects. Don’t even do recursion in terms of how it’s implemented: that’s just an iterative process simulating the recursion with checks at each step. Pre-allocate whatever you’ll likely need for a system with a fail-fast strategy. Get the memory, stack, etc ahead of time so you’re allocating within the process on most stuff. The limit of what it can grab becomes the bounds. You also have memory on the inside to handle the failure conditions on what remains.

              Also, you can have a simple core with other stuff split out of it and isolated. It’s the standard pattern of high-assurance systems. All the industrial microkernels do it, too. This one with self-healing ain’t exactly simple in its user-land either but is robust as can be:

              http://www.qnx.com/content/qnx/en/products/neutrino-rtos/neutrino-rtos.html

              1. 2

                Pre-allocate whatever you’ll likely need for a system with a fail-fast strategy. Get the memory, stack, etc ahead of time so you’re allocating within the process on most stuff

                This is, essentially, what I’ve described: there’s no allocation for the start queue because the service records already embed list nodes. Getting stack ahead of time, and knowing how deep that will let you recurse, is pretty much impossible with portable C/C++ though (but again, as the post says, recursion is avoided anyway). The material in the post is really just one example, but the general principle in Dinit is (if I understand correctly) the same as the principle you’re talking about here.