1. 47
    1. 21

      These invariants are implicit: the assumptions themselves are not explicitly represented in the source code. That means there’s no easy way to, say, mechanically extract them via static analysis

      What this means is that these kinds of failure modes are inevitable.

      The right conclusion imo isn’t “failure is inevitable”, but “implicit invariants must be made explicit”. Whether by capturing them in tests that will break upon violation, by leaning on your type system, or any other method.

      1. 3

        I think you’re correct about invariants needing to be made explicit, but as a practical matter there are so many things that happen on the way to production that this explication cannot be counted on to reliably occur and thus the author’s grim conclusion of inevitability of this failure mode remains.

        A trivial, if extreme, example of this is a normally two-part implementation: first part commits a fix, second tests/registers the invariant. Well, suppose you get yanked off the work before getting to part two. Or you have a heartattack. Or a breakdown. Or you take a weekend and forget a subtlety when it’s time to do part two. Or you go back and tweak part one after building part two, etc. etc. In all of these cases, a trap has been set.

        I imagine that there are design regimes (safety critical systems, aerospace, etc.) where invariants are exhaustively managed and their lifecycles tracked–and that software is comparatively expensive to develop and maintain.

        1. 1

          I think it’s worse than that. The implicit invariants are unknown unknowns. Even with infinite resources and perfect managemet there’s no way to enumerate all the ways in which a different person in a different situation might use something unexpectedly.

          Are we supposed to blame the designer of the penny for the house fires caused by its misuse as a stand-in for proper fuses?

          1. 1

            The implicit invariants are unknown unknowns.

            We’re talking about assumptions that developers make but don’t document or otherwise enforce. For example, you may have a secret that is only ever supposed to me accessed by one particular piece of code. Yet that secret is provided as an ENV variable which is accessible to the entire application. The invariant – the limited access – is clear and known, but because the original dev is lazy, or because it’s hard to figure out how to enforce it, or any other reason, it is not enforced, or perhaps not even documented. This is an example of implicit invariant that should be made explicit.

        2. 1

          You can’t write a negative test for unknowns

          1. 9

            The whole post reads like apologism for low quality software. Sounds more like a lack of ability to understand and design a system clearly and unambiguously than a valuable observation. The given example being classic in that respect. Limitations are often features. Not supporting X van be the reason why Y has unbeatable reliability. These just need to be part of the design. Which I think is the case of the example.

            These invariants are implicit: the assumptions themselves are not explicitly represented in the source code.

            You are doing it wrong. As a programmer YOU decide what’s explicit and have full freedom.to establish whatever contracts you find appropriate. Be explicit about them. Detail them to their canonical forms. Make them clear.. your invariants were not implicit by nature, you just didn’t care about them when you were writing the code.

            1. 11

              Lorin does formal methods and has written (and talked) a lot about improving reliability at Netflix, so I don’t think he’s apologizing for low quality software.

              1. 10

                at Netflix

                As someone using Netflix-developed tools at work, someone’s place of employment does not validate their expertise or ability.

              2. 7

                Honest question, what line of work do you do?

                I’ve done both research in CS and software dev as a founder, and for a startup that needs to figure out its domain while searching for clients and a market, there is absolutely no way to ship code that is bullet proof.

                Of course early stage startup development is a specific niche, but there is a large gap between writing concurrent code for libraries and growing a SaaS app.

                1. 4

                  Software Engineering. Design and implementation. I’ve been in many different industries, rather distinct from each other.

                  and for a startup that needs to figure out its domain while searching for clients and a market, there is absolutely no way to ship code that is bullet proof.

                  You’re equating code quality to a commodity that is just a matter of resources. Such as time. I very much disagree with that assumption. You save time from doing a proper design. A startup has less requirement (variants), hence allow for a simpler and more robust design. Correctness is more at reach. When a company becomes larger and complex, that where “there is no time for…”. It is at that point that it becomes more difficult to keep complexity and bugs at bay. Not the opposite.

                  I didn’t mean that everyone should start to shield all their code code with formal proof systems as people seem to think.

                  What I meant is that a sound design and implementation go a long way from day one. And that the article reads like a blanket statement of questionable utility . Its like “all software has bugs”. Certainly some has more and some has fewer, but is it poor luck or can we as programmers have control over that?

                  1. 7

                    What I meant is that a sound design and implementation go a long way from day one. And that the article reads like a blanket statement of questionable utility . Its like “all software has bugs”

                    In my personal experience, which could be quite biased since I don’t have kernel programming experience, reading this article was quite the opposite. I never felt like reading a complaint, but quite the opposite. I don’t think the article reads just as a “blanket statement of questionable utility” since, in the end, they are just doing good to the domain by exposing their debugging techniques and, while doing it, showing how easy it is to get it wrong.

                    The whole explanation on the safety of this kind of errors in the Hubris’s kernel shows the care they have for this craft, as he admits:

                    The original memory access check algorithm was blocking accesses by a correct program, not admitting invalid accesses by a wrong or malicious program

                2. 2

                  Not supporting X van be the reason why Y has unbeatable reliability.

                  From what I’ve seen of the ntpsec rewrite, this is definitely correct.

                  1. 7

                    Fascinating to see Virding’s law even in firmware. There first three “difference” bulletpoints are:

                    1. Fault isolation: isolated tasks restartable via idempotent messages
                    2. Failing toward safety: parse don’t validate
                    3. Safer shared memory: move-semantics IPC, tasks as servers

                    I’m bias, of course; but, all I see is a victory of message passing, supervision trees, and type safety patterns.

                    1. 5

                      Such an awesome write-up! It’s really interesting to see how something so low-level like this is both built & debugged

                      1. 4

                        Implicit invariants are, by definition, impossible to enforce explicitly

                        No by definition they have not been enforced explicitly, whether or not they might be.

                        This makes me think - to what extent would be it useful to have a program start up and poll some kind of capability endpoint on its dependencies to tell the user whether or not assumptions remain valid.

                        1. 3

                          I think it’s more accurate to say that assumptions change over time. And I think this is what the author means anyway, so this is probably just a nitpick on words. But the motivating example is clearly phrased in terms of assumptions changing, and that’s what I think the more important point is.

                          Think of it this way - no invariant is stated in a vacuum. It’s always stated under some set of assumptions about the system. The way to say that in logic is that the assumptions imply the invariant, or:

                          Assumptions => Invariant
                          

                          This immediately shows the problem - what happens when the Assumptions are true? Then the above statement is also true, but vacuously so (false => true is true, which trips up every student of logic ever). This is exactly the scenario described in the post:

                          This code, which was still simple and easy to read, was now also wrong.

                          For me, this doesn’t devalue invariants at all. The best part of thinking formally is that you think formally, i.e. you have to encode your assumptions. If the assumptions are wrong, you have to figure out how to express them more accurately. And yes, they do change over time. Which is why it’s even more important to have them written down. The much bigger problem is assumptions that change over time, and you don’t even know what they are.

                          1. 1

                            what happens when the Assumptions are true?

                            You mean when they’re false?

                          2. 2

                            Hearing about this bug and other ones on that oxide podcast… I feel like there’s a lot of space for a “very rich” Oxide to invest a bunch into some formal proving techniques. I’d personally find a “Rust -> Agda/Coq” translator very interesting. Write your Rust, get translated code in those languages, then write proofs that your “spec implementation” behaves identically to an improved implementation… the level of formalism required is super high but I feel like you could do something interesting if your only requirement was you needed some command pattern formalism in your Rust code.

                            EDIT: at the very least some property testing!

                            1. 1

                              Or in other words, formal verification doesn’t apply to systems modelling real world.

                              1. 12

                                The exact opposite I think. Formal verification is the process of explicitly figuring out what invariants the code relies on, what invariants the code upholds, and testing that those are the same.

                                If Hubris was formally verified, the invariant that loaned memory would fit into one region would have been written down as a part of the proof of correctness, and when that invariant was changed the tooling would have told them that the code relying on it was no longer correct.

                                1. 5

                                  What does it have to do with formal verification?

                                  1. 5

                                    In critical software land, such as avionics (e.g. DO-178C), and indeed systems/quality engineering generally in many industrial settings, the emphasis is on V&V (as contrasted to a solitary V), which is Validation and Verification. Validation is actually making sure that your requirements/specification correctly capture what is required by the end user (the bit that was missing here, un uncaptured invariant), and Verification is the process of making sure that the software performs according to the specification.

                                    You need both! They can’t really exist in isolation. We have a slight annoyance in the current state of computer science that you have to use a specific language like TLA+ to check that a slightly lower level specification (e.g. memory allocation) cannot put the system into a state where it violates a set of rules about the system behaviour, before then having to implement that specification in, say, rust, with a completely independent effort to make sure the rust is doing the same thing as the TLA. There is a whole layer of real-world stuff on top for many practical problems, like it’s hard to capture JPL using Metric and Lockheed using Imperial on their shared Mars mission, but it’s more just layers of an onion with different intermediate languages and varying levels of messy human-factor/organizational-behavior things to capture and specify in different layers; but it’s all still just V&V and it absolutely applies to the real world.

                                    1. 3

                                      That doesn’t feel like a very helpful formulation. They must apply to some extent, or they wouldn’t be used and wouldn’t be useful.