1. 6
  1.  

  2. 3

    Why you shouldn’t compile asserts out in production builds

    Many of the popular embedded platforms have options to compile out error handling and assertions. In our opinion, this is a mistake. In the event of inconsistent behavior or memory corruption, crashing the device is often the safest thing you could do. Embedded systems can often reboot quickly, getting the system back to a known good state. Over time, it will also be noticed and provide valuable feedback back to your engineering teams. The alternative is worse: the system could behave in unpredictable way and perform poorly, lose customer information, …etc.

    That is an interesting discussion.

    We disable asserts in production builds. The primary argument: Even an insignificant feature can crash the whole device. I don’t want the user to notice if an assert in some potentially-never-looked-at diagnostics routine is false.

    Since we build a real-time system, timing behavior is relevant. So we disable asserts in system tests as well.

    In unit tests asserts are enabled but not so useful. A programmer can often keep a unit in mind, so the assumptions about a unit are often true and encoding them into asserts results in asserts which rarely find bugs.

    That only leaves our functional simulations, where asserts are really useful.

    Effectively, asserts are primarily documentation of assumptions for us. Since we build safety-critical software the code rarely makes assumptions. Instead every assumption is tested and some safety mechanism is triggered in case it is violated. Thus, only few asserts in our code.

    1. 3

      We mention in the blog-post that in the context of safety-critical software you make different decisions.

      That being said, which is more dangerous:

      1. the system is in an inconsistent state and assumptions are violated, or
      2. the system reboots

      I do not have experience with safety-critical system, but I recall that most certifications mandate a very fast reboot time. This would make me err towards (2) being the lesser of the two evils. The risk seems easier to characterize, and it gives you the opportunity to catch the error.

      That being said, you bring up a good point: you probably don’t want to assert when a problem is recoverable in that case. So a middle ground might be: an assert_debug and an assert_always function, the former gets compiled out but the latter does not.

      1. 2

        I’m in automotive, so an example might be that the Automated Emergency Brake function runs into an assert. As a driver I would prefer that only this feature silently restarts. Alternatively, Lane Keeping and Cruise Control get switched off as the device resets.

        We agree that safety always comes first. However, safety often requires tradeoffs for availability, so we want to optimize availability without sacrificing safety.

        The two assert proposal was raised here as well. The problem is that is often hard to decide which one to use.

        1. 1

          One thing worth considering is that you do not have to crash on assert. Ultimately, assert is software like any other and you could chose to restart a service, subsystem, or the whole system on a given assert based on what it is asserting.

          So your API would now be assert( boolean_expression, assert_type), where assert type could include “system assert”, “service assert”, “subsystem assert” or something like that.

      2. 3

        Since we build safety-critical software the code rarely makes assumptions. Instead every assumption is tested and some safety mechanism is triggered in case it is violated.

        Asserts are about assumptions about the calling code, not external systems.

        For example, if a function has as it’s precondition, that the pointer handed to it is non-null, what “safety mechanism is triggered” if it is handed a null pointer?

        1. 1

          An internal assumption might be: A floating point computation does not result in NaN.

          The safety mechanism might be too restart only one software component instead of the whole device.

          1. 2

            Of course, that requires some pretty careful and complex design and thorough testing in itself to reestablish coordination between communicating components.

            Especially where either of them can die and restart at any stage in their communication.

            Erlang has some pretty good patterns for this, but sadly, they’re pretty uncommon in C.

            So you’re essentially saying your “safety mechanism” is curl up and die and restart (but in a bounded subsystem).

            Sadly, most times I have worked with and inspected and tested such designs (restarting a subsystem), connascence has shown it’s hideous face and many man years have been sunk trying to get it to, (and sadly, sometimes, back to), rock solid works 99.999% of the time)

            I wouldn’t describe that as “disable asserts in production build”, rather “compiled into production builds and attempting to reestablish correct functioning as rapidly and reliably as possible”. (Which is what I do).

            Handling the problem random resets in communicating subsystems is a whole ’nother conversation on how to do it right.

            1. 1

              I agree that this bounded restarts add some serious complexity. So far choosing the simpler approach (reset whole device) was good enough for us in terms of availability.

              Erlang is certainly optimized for availability and (to some degree) for real-time applications. So I believe it can be a great inspiration. I have no experience with it though. As far as I know Erlang has not been used for safety-critical stuff though.

      3. 2

        Making the Most of Asserts

        I record the backtrace and the precise program counter. That’s it. Nothing more. OK. Also a time stamp can be useful and maybe a couple of uintptr_t words the programmer can add to help debug.

        Care needs to be taken as the optimizer will sweep all common code into one call to the assert utility, then you don’t know which of several asserts in a function fired! We ended up going with a gcc asm oneliner to get the precise PC.

        https://www.gnu.org/software/libc/manual/html_node/Backtraces.html

        My biggest problem with code that checks for malloc returning null and attempting to handle it….

        …usually it is untested, buggy, and somewhere along the line uses malloc to do it’s job! (Guess what lurks in the depths of a printf?)

        The next problem on a system with swap…. these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!

        The light weight IP stack uses pool allocators with quite small pools for resources that may have (potentially malicious) spikes in usage. But then you will find all over it the attitude “this is an IP packet, somethings wrong / I can handle it / I don’t know enough / I don’t have enough resources / ….” I’ll just drop the packet. If it matters the higher layers will retry.

        Another good pattern is to malloc everything you need for this configuration at initialization time … at least then you know then and there that that configuration will work… if you can’t, you reboot to a safe configuration.

        When Not to Assert

        Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).

        Design by Contract

        Please read and understand https://en.wikipedia.org/wiki/Design_by_contract

        I regard DbC as the most important concepts in producing correct software, and has a lot to say about asserts.

        1. 1

          Lots to chew on here. Thanks for sharing about DbC. Do you have any good books on the topic you’d recommend?

          I record the backtrace and the precise program counter. That’s it.

          In embedded use cases, you cannot always spare the code size for the unwind tables. This makes the backtrace builtin less than useful. Instead, we usually grab PC + LR.

          Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).

          You are absolutely right, we’ll add a note to the post

          The next problem on a system with swap…. these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!

          A good point, though the systems we cover here do not fall under that definition

          1. 2

            The canonical grand father book is “Bertrand Meyer. Object-oriented software construction. Prentice Hall, 1997”

            Sadly it’s such a fundamental dating back to early program proving papers… the Comp Sc types regard it as “done to death” and are picking over obscure corners… and the proprietary types feel Meyer and Eiffel have cornered the market…

            Sigh.

            I wish I could point you to a modern well written tome focused solely on DbC and not on some library or language.

            If you find on, please tell me.

            Don’t need unwind tables.

            For the particular embedded cpu we’re using, libc didn’t have support for backtrace, so we’re rolled our own walking up the framepointers picking out the return addresses from each frame.

            I should imagine the arm glibc would work out the box.

            A gotcha is it has, ahh, imprecisions thanks to the optimizer.

            If the optimizer can at all get away without creating a frame, it will. Thus the real call graph may be A calls B calls C calls D, but the optimizer elided the frames for B and D … the backtrace will show A called C and died amazingly somewhere in D.

            I bound the number of return addresses we store on an assert failure to something small but useful. (5 I think)

            1. 1

              For the particular embedded cpu we’re using, libc didn’t have support for backtrace, so we’re rolled our own walking up the framepointers picking out the return addresses from each frame.

              ARM-v7m doesn’t use a frame pointer, which is why implementations of backtrace rely on unwind tables in that case.

              For what it’s worth, our approach is: ship the stacks back, and on the web backend grab the unwind tables from the symbol file (elf) and work out the backtrace there.

        2. 1

          Dynamic memory allocation is common in embedded systems

          I expected the opposite to be the case…