1. 92
    1. 15

      Sounds a lot like Erlang’s “let it fail”.

      I find over and over again that any system that models the world as intercommunicating tasks and has a designer who really works out all the details for robustness… tends to look a lot like Erlang.

      1. 4

        Well, the post even mentions Erlang, in the context of only having one supervisor instead of an Erlang-style supervisor tree.

        1. 1

          Had the same thought while reading the post.

        2. 7

          This makes me want to patch the libc syscall stubs to turn EINVAL into an exception. I wonder what bugs would appear?

          1. 9

            You’re (half-)kidding but I saw something like that being very useful. Years ago, I worked on a kernel that was superficially similar to Hubris in some design aspects (although this being like 2011 or so, it was all C. Sue me). At one point, me and a colleague chased a truly awful bug for two very long evenings, and eventually traced it to a library function that returned a somewhat non-commital error code, the idea being that the caller was in a better position to figure out if it was fatal or not.

            After we got all the swearing out of our system, we came out with a pretty original scheme. We had weekly testing sessions (this was a small company that didn’t have a dedicated QA department, so we all took turns doing weekly QA sessions for things that could only be checked manually – these were embedded systems that needed some functional testing), and every 2-3 weeks, we’d have two people do the testing instead of the usual one, by rotation. One of us would get a “nasty” version, where system library code would trigger an “exception IRQ” (we didn’t really have useful support for exceptions but we improvised) instead of returning something like EINVAL.

            I don’t know if we discovered too many bad bugs with it. What it did help us find, though, were a lot of cornercases that we were not handling particularly well. For example, our power management code dealt poorly with some interaction patterns, and ended up falling back to a needlessly loose default handler that burned far more battery than was needed.

          2. 6

            The REPLY_FAULT primitive sounds good to me. I once worked for a company with a proprietary embedded kernel that used the SEND RECV REPLY primitives. Much better than using locks and shared memory. Also better than the Actor model: the difference is that SEND is a blocking primitive, and that solves a lot of synchronization problems. REPLY_FAULT is a really nice addition to the paradigm.

            1. 1

              You still want to be able to send a message and just continue execution in the same context. So in blocking context you need to receive reply first and then you can continue. In Erlang you achieve sync by explicitly choosing to wait for the reply.

              I really don’t see much difference between this two approaches. Async-by-default vs sync-by-default.

              1. 5

                What QNX does is if you send a message to a process waiting to receive a message, it inherits your timeslice and switches to that process. I imagine it papers over a lot of performance issues with this.

                1. 2

                  iirc Google uses a scheduler patch to achieve something similar - see this talk. They add a SwitchTo syscall to hand off control to another thread.

                2. 4

                  You can simulate asynchronous message passing (actor model) using synchronous message passing, and vice versa. In practice this leads to different idioms being used for programming the two kinds of systems.

                  The synchronous nature of synchronous message passing has some advantages for the kind of system described by the author Cliff Biffle (a microkernel for an embedded system with constrained resources).

                  • Synchronous message passing requires fewer resources to implement, because you don’t need a message queue.
                  • Synchronous message passing supports synchronization as a primitive, without having to build that up out of asynchronous primitives. This means that the message passing primitives provide more guarantees out of the box, and helps make programs easier to reason about. In the particular case described by the article, REPLY_FAULT can be used to guarantee that a task will be terminated if it passes bad arguments to a server, preserving a stack trace at the exact point where the bad arguments were sent, and you can’t implement that guarantee in an actor system.
                  1. 3

                    Async vs. sync seems irrelevant to the main point, which is to have a REPLY_FAULT that abruptly stops a buggy client task.

                    There’s a discussion in the Hubris manual about why IPC is synchronous (except notifications, which are async)

                    1. 5

                      I would say that async vs sync is central to the main point. REPLY_FAULT guarantees to abruptly stop a buggy client task at the exact point where it is sending a message containing invalid arguments, leaving a stack trace for the debugger. You can’t implement that guarantee using async message passing (the actor model), because the task has already moved on after leaving a bad message in the message queue. The Hubris link you gave says “synchronous IPC makes the system much easier to think about”. The ability to implement REPLY_FAULT is an example of this.

                3. 1

                  This seems great, but I wonder how much less error-resistant programs will be on Hubris. Throwing an error at runtime in only some conditions somewhere deep down the callstack is something we learned to avoid from type errors in dynamically typed languages. To me, runtime assertions are similar to those - what’s the type system for if it can’t prevent your program for crashing anyways?

                  Though I’ll admit I have not looked at Hubris’s API design. It’s totally possible they do have a strongly-typed API that would catch these mistakes early via type checking.

                  1. 11

                    From the article, IPCs do have strongly typed APIs:

                    Tasks are connected to each other by configuration in the build system, so it’s hard to confuse one for the other. Clients use generated Rust code to construct and send IPCs to servers, which use different generated Rust code to handle the result. This lets us squint and pretend that the type system works across task boundaries — it doesn’t, really, but our tools produce a pretty good illusion.