Overrated. A phone is not a government server with lots of actual human users with different secret access clearances, why would you need MAC? App isolation can be done in simpler better ways.
verified boot
Not doing it makes user control easier, screw fiddling with signatures :) But an option to have it would be nice.
app sandboxing
Much less necessary when the apps are FOSS from a trusted repo, instead of ad-ridden proprietary store apps.
Firefox sandboxes its own content processes pretty well.
Modem isolation isn’t anything special. Qualcomm SoCs have isolated the modem via an IOMMU for years.
Well, I’d trust good old USB way more than Qualcomm’s piles of custom stuff.
Much less necessary when the apps are FOSS from a trusted repo, instead of ad-ridden proprietary store apps. Firefox sandboxes its own content processes pretty well.
You are only one Geary, KMail, Evince, or Okular vulnerability away from a full user account compromise. Sandboxing does not only protect against untrusted applications, but also against vulnerabilities in trusted applications. At least the OpenBSD folks understood this and made pledge, etc and introduced it across the base system. But the larger Linux/FLOSS Unix ecosystem does not seem to get this yet.
The reason why the lack of a proper security model isn’t actively exploited yet, is that the Linux desktop is such a small blip on the radar that it is not a worthwhile target yet. But if you want to compete in the smart phone or desktop markets, you should address such issues.
I’m certainly not a security professional, and I don’t want to twist your words, however it sounds to me like you’re advocating “security by not making mistakes.” As if it doesn’t matter how technically easy it is to compromise the whole system, because you’re always only running software written by trustworthy people.
I think the main point of the article is that Android and iOS, while certainly laden with junk and corporate interests, do indeed have a better sandboxing and overall security model than traditional desktop / server OSes like Linux. You need an IT professional to set up sandboxing on Linux. But a regular consumer can have sandboxing by default on Android and iOS.
So the author has a good point; I too am somewhat concerned that by discarding Android and iOS, we’re potentially throwing out the baby with the bath water.
You need an IT professional to set up sandboxing on Linux.
Most apps run fine with something like firejail. You don’t need to be an “IT professional” (however you quantify that..?) but I agree it’s not straight forward. The tools are there for someone (Purism?) to create something easier for folks to use out of the box.
Security must be unobtrusive, unavoidable, and cheap.
Or it won’t get used.
If “most” applications work in Firejail, then most people (including me) won’t bother trying. It has to work with all of the applications that typical people want to try, and the best way to ensure that developers actually test their app with it is if it’s on by default and rarely turned off.
the best way to ensure that developers actually test their app with it is if it’s on by default and rarely turned off.
It’s best if security is at the core of the system. This is why investing in systems built around capabilities (not to be confused with POSIX capabilities) from the start is the only path going forward. seL4 is a good core to build such a system around.
As someone who has used firejail and bubblewrap, this is a who needs Dropbox, you can do this trivially with curlftpfs comment. There is no way a common user is going to set up firejail. And setting up firejail for them for every (GUI) application is going to be either: meaningless, because you have to expose large parts of a system to make an applications usable; or useless, because your application is contained to such an extend that you cannot open files, pass data between applications, etc. Most of the work in implementing proper sandboxing is in providing mechanisms to securely open files from different places, passing data between applications, etc.
Most of the work in implementing proper sandboxing is in providing mechanisms to securely open files from different places, passing data between applications, etc.
This is best achieved in a system that’s designed this way from the get-go. This is what capability-based microkernel multiserver systems such as Genode/seL4 are. Google is no stranger to this, thus Fuchsia. Huawei isn’t sleeping either, thus HarmonyOS.
A capability-based system may be better, but until such a system becomes mainstream, we are better served by good sandboxing in a traditional OS than no sandboxing at all.
Well, I never said that firejail was ready for the masses. I was merely pointing out that it’s more approachable than most sandboxing options, and wouldn’t be a stretch for someone/people to make it even more seemless to use.
And yet, USB (< USB4) has some hardening built-in: for example, no ability for devices to do hostile DMA because that’s all host moderated (the host decides when transfers happen and which memory locations are involved).
But what does it even mean for the host to “decide” something if the USB device can hijack it via another vector and get it to “decide” something else? I’m sure it has some hardening, but I think even the designers of USB would have to agree it is poorly suited to this use case.
What other vector would that be? Every USB transfer only happens when the host wants it to happen. There just is no way for a device to force a change to memory, and therefore the host behavior.
We call the optimization “Memory Tetris”. Since we have different RAMs, it usually means to shift symbols to different regions. For example, use slower global RAM instead of fast core-local one.
Note that in this post we look at code size rather than RAM footprint. Though if you statically allocate everything you can trivially do the same thing for the data section.
There are a bunch of plugins for Jenkins for similar things (link), like tracking warnings or test coverage. There’s nothing for firmware size however. Maybe something could be rigged using some of the tools in the linked article, into a generic firmware size tracking plugin?
No, but I do like the site (my kind of work, I write c++ for an embedded device at work). Just as fluentcpp and a few other c++ sites which I now cannot post anymore.
The only site I’m affiliated with is my personal site, see my profile for that link.
This has as much to do with USB PD being complicated as it does with 32-bit MCUs becoming extremely cheap.
You can implement PD with a tiny 8-bit MCU and a few hundred bytes of RAM, but why bother when you can spend a few more cents and use a cortex-M0?
We’ve long crossed the threshold where the incremental engineering effort is more expensive than the incremental compute & RAM. This is the reason why USB chargers could fly us to the moon.
At the moment, I think the right balance to strike is to maintain a low level C BSP and link in Rust code on top.
I admire what the Rust community is doing with embedded-hal, and I hope they eventually convince chip makers to contribute Rust BSPs using the standard API.
Until then, you’re left relying with an unofficial BSP with no support whatsoever. This is a deal breaker for most organizations.
I2C sounds simple and powerful, but I have grown quite resentful of it. This stems from a decade of having to work with it - debugging is the worst.
There are a myriad of issues with the bus, the worst being that it can get stuck - and there’s no way to get it out of that state. A comment below the article has already mentioned this. Of course you can wire resets to all the bus devices, but that increases pin count from 2 to 3, not counting GND, and is outside of what the I2C bus standard entails. Of course you can try to wiggle the bus a bit and see if this frees it from a stuck state. Maybe now you have an unwanted write to an EEPROM. Or maybe the method doesn’t work because a stuck slave is clock stretching indefinitely. The list can be made to go on for quite a while.
My personal advice for those considering to use I2C: Don’t, unless you are forced to. If you still think you want to use, the following checklist will help you have a better experience.
Have only a single master. I have never seen a trustworthy multi-master system.
Have only “simple” slaves. I.e. EEPROMs with hardware implemented I2C interfaces without clock stretching are OK. Microcontrollers are not OK, the program can get stuck and you wind up with that slave stretching the clock indefinitely. Be aware that if you buy, say, “an IMU”, it may actually contain a microcontroller handling the I2C communication. This is the case for a number of more complex devices.
No hotplug.
Have all devices powered from the same supply.
Have few devices on the bus.
Have a short bus.
Nice to have: Be able to reset slaves from the master.
One thing I will say though: I’ve shipped ~10 CE products, all of them with I2C devices, and have only had problems with one peripheral: Apple’s MFi chip (used to be required for BT accessories).
So while in practice I2C buses can get stuck, it is not a problem of epidemic proportion as some would have you believe.
Still - it’s good practice to have a plan to recover in the event the issue crops up. Reset pins & power control are the way to go.
Last but not least, always make sure the electrical characteristics of your bus are correct. If transition or hold times are violated, you’re going to have a bad time.
One thing I will say though: I’ve shipped ~10 CE products, all of them with I2C devices, and have only had problems with one peripheral: Apple’s MFi chip (used to be required for BT accessories).
Out of sheer interest: How well did these products match the checklist?
How big a problem you have with I2C depends a lot on your design.
If you have, say, an I2C bus with an imaging sensor, a thermometer and an EEPROM, you might be able do dodge the worst of troubles. If “the bus gets stuck” and you have appropriate measures in place to reset all devices, your device might carry on with only a small hiccup. To do this requires forethought in the hardware design and you have to invest into your software. On the hardware side, you need resets to the chips that support resets, your imaging sensor most certainly. The EEPROM (like 24xx devices) has no reset pin. If the bus interface is not responding, you have to power cycle it. This takes up board space and you have to get the power supply right. On the software side you need to detect that “the bus is stuck” and you need to act. Then your software needs to deal with hardware not always being reachable, maybe even being partially reachable.
I think that it would be much more sensible to have a communication standard where, if communication with one component fails, only communication with that one component fails.
One project I worked on had a central processor and several microcontrollers scattered throughout the system, all connected through I2C. One of the microcontrollers was the first to start and controlled power sequencing for startup, standby and shutdown. It also had an IMU connected, the data of which the main processor needs to read. When reading IMU data, bus traffic increases considerably, triggering a (as I believe, still unsolved) preexisting within a minute instead of within hours. The fault was always shortly after some microcontrollers were powered up and hotplugged to the bus. The bug leads to the power sequencing microcontroller not serving I2C anymore, infinitely clock stretching. The main processor canniot reset the power sequencing uC for different reasons: it does not have a connection to it, apart from the stuck I2C and the power sequencing uC has to be the last system part to be powered down.
Now, clearly this hardware design is bonkers, for a variety of reasons. The power sequencing uC should have fewer responsibilities, i.e. only power sequencing. IMU readout should be ideally with another uC and clearly on a bus where it’s isolated from more critical functions. Also very clearly, the design is bonkers because it violates several points on the I2C checklist and insists on using I2C for vital system parts.
Last but not least, always make sure the electrical characteristics of your bus are correct. If transition or hold times are violated, you’re going to have a bad time.
This is true. However, in many cases you don’t have control over all the conditions that you need to control in order to guarantee reliable operation.
The I2C standard is not really a standard, it’s more like a leisurely application note. This leads to many different interpretations and assumptions about how the bus is driven, assumptions that may not be true. As a result, connecting different, unknown devices together leads to a significant amount of new and exciting bugs.
To sum this us, my advice remains the same: Don’t use I2C unless you have to. If you have to, fulfill the checklist as good as you are allowed to.
As for booleans, I generally hold the view that if your module has more than one boolean, you’ve implemented an implicit FSM with a bunch of invalid states. Make it explicit!
If we’re talking about the same thing, in Kotlin these are “sealed classes” and in Rust they’re just “enums”. In both of these cases, the different variants of the enum can have different structures, because they’re classes instead of just instances. (But they can also be singleton classes.)
(Compare to Java, where all the variants of the enum have the same structure, because they’re just instances of the same class.)
I wasn’t even talking about having data-carrying variants. I was just talking about the ability to associate methods with your enum, mostly lookup tables. In Rust, you can do this with an associated impl block:
The C++ is actually really close to this, even though it’s technically an enum embedded in a class. What C++ doesn’t have is enums as tagged unions. C++ has a form of tagged unions in variant, but it’s not quite the same.
The “what cable to use for what use case” part is extremely useful. I love the modularity of Type-C, but I have to admit the story around cable has gotten very messy.
Why you shouldn’t compile asserts out in production builds
Many of the popular embedded platforms have options to compile out error handling and assertions. In our opinion, this is a mistake. In the event of inconsistent behavior or memory corruption, crashing the device is often the safest thing you could do. Embedded systems can often reboot quickly, getting the system back to a known good state. Over time, it will also be noticed and provide valuable feedback back to your engineering teams. The alternative is worse: the system could behave in unpredictable way and perform poorly, lose customer information, …etc.
That is an interesting discussion.
We disable asserts in production builds. The primary argument: Even an insignificant feature can crash the whole device. I don’t want the user to notice if an assert in some potentially-never-looked-at diagnostics routine is false.
Since we build a real-time system, timing behavior is relevant. So we disable asserts in system tests as well.
In unit tests asserts are enabled but not so useful. A programmer can often keep a unit in mind, so the assumptions about a unit are often true and encoding them into asserts results in asserts which rarely find bugs.
That only leaves our functional simulations, where asserts are really useful.
Effectively, asserts are primarily documentation of assumptions for us. Since we build safety-critical software the code rarely makes assumptions. Instead every assumption is tested and some safety mechanism is triggered in case it is violated. Thus, only few asserts in our code.
We mention in the blog-post that in the context of safety-critical software you make different decisions.
That being said, which is more dangerous:
the system is in an inconsistent state and assumptions are violated, or
the system reboots
I do not have experience with safety-critical system, but I recall that most certifications mandate a very fast reboot time. This would make me err towards (2) being the lesser of the two evils. The risk seems easier to characterize, and it gives you the opportunity to catch the error.
That being said, you bring up a good point: you probably don’t want to assert when a problem is recoverable in that case. So a middle ground might be: an assert_debug and an assert_always function, the former gets compiled out but the latter does not.
I’m in automotive, so an example might be that the Automated Emergency Brake function runs into an assert. As a driver I would prefer that only this feature silently restarts. Alternatively, Lane Keeping and Cruise Control get switched off as the device resets.
We agree that safety always comes first. However, safety often requires tradeoffs for availability, so we want to optimize availability without sacrificing safety.
The two assert proposal was raised here as well. The problem is that is often hard to decide which one to use.
One thing worth considering is that you do not have to crash on assert. Ultimately, assert is software like any other and you could chose to restart a service, subsystem, or the whole system on a given assert based on what it is asserting.
So your API would now be assert( boolean_expression, assert_type), where assert type could include “system assert”, “service assert”, “subsystem assert” or something like that.
Since we build safety-critical software the code rarely makes assumptions. Instead every assumption is tested and some safety mechanism is triggered in case it is violated.
Asserts are about assumptions about the calling code, not external systems.
For example, if a function has as it’s precondition, that the pointer handed to it is non-null, what “safety mechanism is triggered” if it is handed a null pointer?
Of course, that requires some pretty careful and complex design and thorough testing in itself to reestablish coordination between communicating components.
Especially where either of them can die and restart at any stage in their communication.
Erlang has some pretty good patterns for this, but sadly, they’re pretty uncommon in C.
So you’re essentially saying your “safety mechanism” is curl up and die and restart (but in a bounded subsystem).
Sadly, most times I have worked with and inspected and tested such designs (restarting a subsystem), connascence has shown it’s hideous face and many man years have been sunk trying to get it to, (and sadly, sometimes, back to), rock solid works 99.999% of the time)
I wouldn’t describe that as “disable asserts in production build”, rather “compiled into production builds and attempting to reestablish correct functioning as rapidly and reliably as possible”. (Which is what I do).
Handling the problem random resets in communicating subsystems is a whole ’nother conversation on how to do it right.
I agree that this bounded restarts add some serious complexity. So far choosing the simpler approach (reset whole device) was good enough for us in terms of availability.
Erlang is certainly optimized for availability and (to some degree) for real-time applications. So I believe it can be a great inspiration. I have no experience with it though. As far as I know Erlang has not been used for safety-critical stuff though.
I record the backtrace and the precise program counter. That’s it. Nothing more. OK. Also a time stamp can be useful and maybe a couple of uintptr_t words the programmer can add to help debug.
Care needs to be taken as the optimizer will sweep all common code into one call to the assert utility, then you don’t know which of several asserts in a function fired! We ended up going with a gcc asm oneliner to get the precise PC.
My biggest problem with code that checks for malloc returning null and attempting to handle it….
…usually it is untested, buggy, and somewhere along the line uses malloc to do it’s job! (Guess what lurks in the depths of a printf?)
The next problem on a system with swap…. these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!
The light weight IP stack uses pool allocators with quite small pools for resources that may have (potentially malicious) spikes in usage. But then you will find all over it the attitude “this is an IP packet, somethings wrong / I can handle it / I don’t know enough / I don’t have enough resources / ….” I’ll just drop the packet. If it matters the higher layers will retry.
Another good pattern is to malloc everything you need for this configuration at initialization time … at least then you know then and there that that configuration will work… if you can’t, you reboot to a safe configuration.
When Not to Assert
Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).
Lots to chew on here. Thanks for sharing about DbC. Do you have any good books on the topic you’d recommend?
I record the backtrace and the precise program counter. That’s it.
In embedded use cases, you cannot always spare the code size for the unwind tables. This makes the backtrace builtin less than useful. Instead, we usually grab PC + LR.
Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).
You are absolutely right, we’ll add a note to the post
The next problem on a system with swap…. these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!
A good point, though the systems we cover here do not fall under that definition
The canonical grand father book is “Bertrand Meyer. Object-oriented software construction. Prentice Hall, 1997”
Sadly it’s such a fundamental dating back to early program proving papers… the Comp Sc types regard it as “done to death” and are picking over obscure corners… and the proprietary types feel Meyer and Eiffel have cornered the market…
Sigh.
I wish I could point you to a modern well written tome focused solely on DbC and not on some library or language.
If you find on, please tell me.
Don’t need unwind tables.
For the particular embedded cpu we’re using, libc didn’t have support for backtrace, so we’re rolled our own walking up the framepointers picking out the return addresses from each frame.
I should imagine the arm glibc would work out the box.
A gotcha is it has, ahh, imprecisions thanks to the optimizer.
If the optimizer can at all get away without creating a frame, it will. Thus the real call graph may be A calls B calls C calls D, but the optimizer elided the frames for B and D … the backtrace will show A called C and died amazingly somewhere in D.
I bound the number of return addresses we store on an assert failure to something small but useful. (5 I think)
For the particular embedded cpu we’re using, libc didn’t have support for backtrace, so we’re rolled our own walking up the framepointers picking out the return addresses from each frame.
ARM-v7m doesn’t use a frame pointer, which is why implementations of backtrace rely on unwind tables in that case.
For what it’s worth, our approach is: ship the stacks back, and on the web backend grab the unwind tables from the symbol file (elf) and work out the backtrace there.
Very interesting. When using opaque structs in APIs in the past, I’ve aways provided _create and _destroy functions that allocate them on the heap. API users would regularly leak memory, and allowing the use of alloca some cases would have made the API simpler to use.
I’m a long time user of OpenOCD, but lately I’ve been getting more excited about ARM’s PyOCD project. It’s a very well structured piece of code, written in Python. It’s very easy to extend, though it doesn’t support quite as much hardware as OpenOCD. https://github.com/mbedmicro/pyOCD
This is very interesting, I really like the Micro:bit use of CMSIS-DAP, it has two chips on board, one is the programmer
and USB mass storage! You can flash by literally drag & drop mindblown.
It looks like I will soon need a new adapter to try this out, unfortunately mine are not supported.
LTO typically causes large increases to the stack space needed. Since LTO results in aggressive cross-object inlining, a bunch of local variables from many different functions now wind up getting allocated at the same time. When you enable LTO for the first time, expect to see some stack overflows!
Is that correct? Yes, individual stack frames may grow larger (but fewer of them), but I wouldn’t expect total stack usage to grow.
This is based on our experience enabling LTO on a few firmware projects in our career. You can imagine some cases where multiple branches in your execution tree are collapsed in a single stack frame, growing the worst case stack usage.
It seems possible.
Consider a contrived example:
a() - 32 bytes of stack
b() - 32 bytes of stack
c() - 64 bytes of stack
a() calls both b() and c() once.
If b() and c() are inlined into a() and the compiler doesn’t reuse stack slots then the maximum stack depth is now 128 instead of 96.
Possible, sure, though in my experience gcc is pretty clever (aggressive, perhaps) about reusing stack slots. Even with a not-super-recent version of it (I think it was circa 5.1 or so), I recall a few years ago being impressed to discover that it had merged two distinct local arrays to share the same stack space, despite the fact that they had overlapping lifetimes – it had noticed (correctly) that while the lifetimes of the two arrays as a whole overlapped, the lifetimes of each individual corresponding pair of elements (e.g. A[0] and B[0]) did not, and hence arranged things so that the same underlying chunk of memory started out as array A and gradually, element by element, became array B.
Coming from a software background and having moved into embedded, I’ve been surprised how many firmware teams I’ve contracted/consulted to in the past who don’t do any CI and don’t always see a lot of value in it. Can feel 10ish years behind a lot of other software practices.[*] So thanks for writing this, @fbo.
Beyond setting up automated builds, which are not that different to setting up automated builds for “big software”, the next thing I see embedded engineers often struggle with is structuring their code so it’s possible to run meaningful tests on the host. In case you’re looking for encouragement for new blog post topics. ;)
[*] Insert joke about how still writing everything in C is 25 years behind a lot of other software practices.
@projectgus - This is my experience as well. I’d go one step further: forget CI, embedded teams often don’t use a version control system, or do code review (which I see as another run down the ladder from CI in terms of eng. sophistication).
For teams that do consider CI, the hurdle often is that they believe they need to do hardware-in-the-loop testing. This raises the complexity bar a good bit, and dooms many CI deployments. In reality, software-only tests run on x86 are much better than nothing, and take no more than an afternoon to put in place. I’m hoping that by spreading that gospel, I can help teams in the industry make the jump.
We’re writing a post on how to unit test firmware, hopefully we’ll hit the notes you’re looking for. We also welcome guest writers ;-). I’d be thrilled to edit / polish any post on embedded topics folks here want to write.
I love the idea, though perhaps I would have looked at a format like WaveDrom (https://wavedrom.com/) to encode the waveforms rather than Ascii.
Overrated. A phone is not a government server with lots of actual human users with different secret access clearances, why would you need MAC? App isolation can be done in simpler better ways.
Not doing it makes user control easier, screw fiddling with signatures :) But an option to have it would be nice.
Much less necessary when the apps are FOSS from a trusted repo, instead of ad-ridden proprietary store apps. Firefox sandboxes its own content processes pretty well.
Well, I’d trust good old USB way more than Qualcomm’s piles of custom stuff.
You are only one Geary, KMail, Evince, or Okular vulnerability away from a full user account compromise. Sandboxing does not only protect against untrusted applications, but also against vulnerabilities in trusted applications. At least the OpenBSD folks understood this and made pledge, etc and introduced it across the base system. But the larger Linux/FLOSS Unix ecosystem does not seem to get this yet.
The reason why the lack of a proper security model isn’t actively exploited yet, is that the Linux desktop is such a small blip on the radar that it is not a worthwhile target yet. But if you want to compete in the smart phone or desktop markets, you should address such issues.
I’m certainly not a security professional, and I don’t want to twist your words, however it sounds to me like you’re advocating “security by not making mistakes.” As if it doesn’t matter how technically easy it is to compromise the whole system, because you’re always only running software written by trustworthy people.
I think the main point of the article is that Android and iOS, while certainly laden with junk and corporate interests, do indeed have a better sandboxing and overall security model than traditional desktop / server OSes like Linux. You need an IT professional to set up sandboxing on Linux. But a regular consumer can have sandboxing by default on Android and iOS.
So the author has a good point; I too am somewhat concerned that by discarding Android and iOS, we’re potentially throwing out the baby with the bath water.
Most apps run fine with something like firejail. You don’t need to be an “IT professional” (however you quantify that..?) but I agree it’s not straight forward. The tools are there for someone (Purism?) to create something easier for folks to use out of the box.
A CPU developer put it better than I could have:
If “most” applications work in Firejail, then most people (including me) won’t bother trying. It has to work with all of the applications that typical people want to try, and the best way to ensure that developers actually test their app with it is if it’s on by default and rarely turned off.
It’s best if security is at the core of the system. This is why investing in systems built around capabilities (not to be confused with POSIX capabilities) from the start is the only path going forward. seL4 is a good core to build such a system around.
As someone who has used firejail and bubblewrap, this is a who needs Dropbox, you can do this trivially with curlftpfs comment. There is no way a common user is going to set up firejail. And setting up firejail for them for every (GUI) application is going to be either: meaningless, because you have to expose large parts of a system to make an applications usable; or useless, because your application is contained to such an extend that you cannot open files, pass data between applications, etc. Most of the work in implementing proper sandboxing is in providing mechanisms to securely open files from different places, passing data between applications, etc.
This is best achieved in a system that’s designed this way from the get-go. This is what capability-based microkernel multiserver systems such as Genode/seL4 are. Google is no stranger to this, thus Fuchsia. Huawei isn’t sleeping either, thus HarmonyOS.
A capability-based system may be better, but until such a system becomes mainstream, we are better served by good sandboxing in a traditional OS than no sandboxing at all.
Absolutely.
Well, I never said that firejail was ready for the masses. I was merely pointing out that it’s more approachable than most sandboxing options, and wouldn’t be a stretch for someone/people to make it even more seemless to use.
In my experience USB stacks (both OS-level and FW-level) are riddled with security bugs. Not what I’d want to use for modem isolation.
And yet, USB (< USB4) has some hardening built-in: for example, no ability for devices to do hostile DMA because that’s all host moderated (the host decides when transfers happen and which memory locations are involved).
But what does it even mean for the host to “decide” something if the USB device can hijack it via another vector and get it to “decide” something else? I’m sure it has some hardening, but I think even the designers of USB would have to agree it is poorly suited to this use case.
What other vector would that be? Every USB transfer only happens when the host wants it to happen. There just is no way for a device to force a change to memory, and therefore the host behavior.
Any buffer overflow in the USB stack on the host OS, or any similar error in the USB controller firmware.
Does the USB controller on the librem 5 have DMA, or is it going through an IOMMU?
MAC has nothing to do with multi-users.
We call the optimization “Memory Tetris”. Since we have different RAMs, it usually means to shift symbols to different regions. For example, use slower global RAM instead of fast core-local one.
Note that in this post we look at code size rather than RAM footprint. Though if you statically allocate everything you can trivially do the same thing for the data section.
There are a bunch of plugins for Jenkins for similar things (link), like tracking warnings or test coverage. There’s nothing for firmware size however. Maybe something could be rigged using some of the tools in the linked article, into a generic firmware size tracking plugin?
That would be fantastic. CircleCI also has the concept of “Orbs” which we looked into.
I wanted to submit this article but due to the new rules I couldn’t, so glad to see someone else did it.
I’m happy with my Linux-sunxi hardware watchdog,
What rules?
These ones: https://lobste.rs/s/utbyws/mitigating_content_marketing
Are you affiliated with memfault?
No, but I do like the site (my kind of work, I write c++ for an embedded device at work). Just as fluentcpp and a few other c++ sites which I now cannot post anymore.
The only site I’m affiliated with is my personal site, see my profile for that link.
Gotcha. I thought they were looking for a way for non-affiliated folks to still post more than N posts by site.
they are not
This has as much to do with USB PD being complicated as it does with 32-bit MCUs becoming extremely cheap.
You can implement PD with a tiny 8-bit MCU and a few hundred bytes of RAM, but why bother when you can spend a few more cents and use a cortex-M0?
We’ve long crossed the threshold where the incremental engineering effort is more expensive than the incremental compute & RAM. This is the reason why USB chargers could fly us to the moon.
At the moment, I think the right balance to strike is to maintain a low level C BSP and link in Rust code on top.
I admire what the Rust community is doing with
embedded-hal
, and I hope they eventually convince chip makers to contribute Rust BSPs using the standard API.Until then, you’re left relying with an unofficial BSP with no support whatsoever. This is a deal breaker for most organizations.
I2C sounds simple and powerful, but I have grown quite resentful of it. This stems from a decade of having to work with it - debugging is the worst.
There are a myriad of issues with the bus, the worst being that it can get stuck - and there’s no way to get it out of that state. A comment below the article has already mentioned this. Of course you can wire resets to all the bus devices, but that increases pin count from 2 to 3, not counting GND, and is outside of what the I2C bus standard entails. Of course you can try to wiggle the bus a bit and see if this frees it from a stuck state. Maybe now you have an unwanted write to an EEPROM. Or maybe the method doesn’t work because a stuck slave is clock stretching indefinitely. The list can be made to go on for quite a while.
My personal advice for those considering to use I2C: Don’t, unless you are forced to. If you still think you want to use, the following checklist will help you have a better experience.
Great comment, and your checklist is dead on.
One thing I will say though: I’ve shipped ~10 CE products, all of them with I2C devices, and have only had problems with one peripheral: Apple’s MFi chip (used to be required for BT accessories).
So while in practice I2C buses can get stuck, it is not a problem of epidemic proportion as some would have you believe.
Still - it’s good practice to have a plan to recover in the event the issue crops up. Reset pins & power control are the way to go.
Last but not least, always make sure the electrical characteristics of your bus are correct. If transition or hold times are violated, you’re going to have a bad time.
Out of sheer interest: How well did these products match the checklist?
How big a problem you have with I2C depends a lot on your design.
If you have, say, an I2C bus with an imaging sensor, a thermometer and an EEPROM, you might be able do dodge the worst of troubles. If “the bus gets stuck” and you have appropriate measures in place to reset all devices, your device might carry on with only a small hiccup. To do this requires forethought in the hardware design and you have to invest into your software. On the hardware side, you need resets to the chips that support resets, your imaging sensor most certainly. The EEPROM (like 24xx devices) has no reset pin. If the bus interface is not responding, you have to power cycle it. This takes up board space and you have to get the power supply right. On the software side you need to detect that “the bus is stuck” and you need to act. Then your software needs to deal with hardware not always being reachable, maybe even being partially reachable.
I think that it would be much more sensible to have a communication standard where, if communication with one component fails, only communication with that one component fails.
One project I worked on had a central processor and several microcontrollers scattered throughout the system, all connected through I2C. One of the microcontrollers was the first to start and controlled power sequencing for startup, standby and shutdown. It also had an IMU connected, the data of which the main processor needs to read. When reading IMU data, bus traffic increases considerably, triggering a (as I believe, still unsolved) preexisting within a minute instead of within hours. The fault was always shortly after some microcontrollers were powered up and hotplugged to the bus. The bug leads to the power sequencing microcontroller not serving I2C anymore, infinitely clock stretching. The main processor canniot reset the power sequencing uC for different reasons: it does not have a connection to it, apart from the stuck I2C and the power sequencing uC has to be the last system part to be powered down.
Now, clearly this hardware design is bonkers, for a variety of reasons. The power sequencing uC should have fewer responsibilities, i.e. only power sequencing. IMU readout should be ideally with another uC and clearly on a bus where it’s isolated from more critical functions. Also very clearly, the design is bonkers because it violates several points on the I2C checklist and insists on using I2C for vital system parts.
This is true. However, in many cases you don’t have control over all the conditions that you need to control in order to guarantee reliable operation.
The I2C standard is not really a standard, it’s more like a leisurely application note. This leads to many different interpretations and assumptions about how the bus is driven, assumptions that may not be true. As a result, connecting different, unknown devices together leads to a significant amount of new and exciting bugs.
To sum this us, my advice remains the same: Don’t use I2C unless you have to. If you have to, fulfill the checklist as good as you are allowed to.
Remember kids, enums spread diseases: https://codecraft.co/2012/10/29/how-enums-spread-disease-and-how-to-cure-it/.
As for booleans, I generally hold the view that if your module has more than one boolean, you’ve implemented an implicit FSM with a bunch of invalid states. Make it explicit!
Maybe it wasn’t widely supported at the time, but I’d use enum class members instead of the C preprocessor for this. Something like this in C++:
Not only because I really don’t like using macros, but also because it makes sense to be able to easily switch to make it a non-enum class.
Enum classes are a great solution to the problem, when they are available in your language.
If we’re talking about the same thing, in Kotlin these are “sealed classes” and in Rust they’re just “enums”. In both of these cases, the different variants of the enum can have different structures, because they’re classes instead of just instances. (But they can also be singleton classes.)
(Compare to Java, where all the variants of the enum have the same structure, because they’re just instances of the same class.)
I wasn’t even talking about having data-carrying variants. I was just talking about the ability to associate methods with your enum, mostly lookup tables. In Rust, you can do this with an associated impl block:
The C++ is actually really close to this, even though it’s technically an enum embedded in a class. What C++ doesn’t have is enums as tagged unions. C++ has a form of tagged unions in
variant
, but it’s not quite the same.When using the terms for algebraic data types that’s basically a sum of products.
That article is about a very different use of enum than this article talks about. Enums used as flags are pretty harmless.
The “what cable to use for what use case” part is extremely useful. I love the modularity of Type-C, but I have to admit the story around cable has gotten very messy.
That is an interesting discussion.
We disable asserts in production builds. The primary argument: Even an insignificant feature can crash the whole device. I don’t want the user to notice if an assert in some potentially-never-looked-at diagnostics routine is false.
Since we build a real-time system, timing behavior is relevant. So we disable asserts in system tests as well.
In unit tests asserts are enabled but not so useful. A programmer can often keep a unit in mind, so the assumptions about a unit are often true and encoding them into asserts results in asserts which rarely find bugs.
That only leaves our functional simulations, where asserts are really useful.
Effectively, asserts are primarily documentation of assumptions for us. Since we build safety-critical software the code rarely makes assumptions. Instead every assumption is tested and some safety mechanism is triggered in case it is violated. Thus, only few asserts in our code.
We mention in the blog-post that in the context of safety-critical software you make different decisions.
That being said, which is more dangerous:
I do not have experience with safety-critical system, but I recall that most certifications mandate a very fast reboot time. This would make me err towards (2) being the lesser of the two evils. The risk seems easier to characterize, and it gives you the opportunity to catch the error.
That being said, you bring up a good point: you probably don’t want to assert when a problem is recoverable in that case. So a middle ground might be: an assert_debug and an assert_always function, the former gets compiled out but the latter does not.
I’m in automotive, so an example might be that the Automated Emergency Brake function runs into an assert. As a driver I would prefer that only this feature silently restarts. Alternatively, Lane Keeping and Cruise Control get switched off as the device resets.
We agree that safety always comes first. However, safety often requires tradeoffs for availability, so we want to optimize availability without sacrificing safety.
The two assert proposal was raised here as well. The problem is that is often hard to decide which one to use.
One thing worth considering is that you do not have to crash on assert. Ultimately, assert is software like any other and you could chose to restart a service, subsystem, or the whole system on a given assert based on what it is asserting.
So your API would now be
assert( boolean_expression, assert_type)
, where assert type could include “system assert”, “service assert”, “subsystem assert” or something like that.Asserts are about assumptions about the calling code, not external systems.
For example, if a function has as it’s precondition, that the pointer handed to it is non-null, what “safety mechanism is triggered” if it is handed a null pointer?
An internal assumption might be: A floating point computation does not result in NaN.
The safety mechanism might be too restart only one software component instead of the whole device.
Of course, that requires some pretty careful and complex design and thorough testing in itself to reestablish coordination between communicating components.
Especially where either of them can die and restart at any stage in their communication.
Erlang has some pretty good patterns for this, but sadly, they’re pretty uncommon in C.
So you’re essentially saying your “safety mechanism” is curl up and die and restart (but in a bounded subsystem).
Sadly, most times I have worked with and inspected and tested such designs (restarting a subsystem), connascence has shown it’s hideous face and many man years have been sunk trying to get it to, (and sadly, sometimes, back to), rock solid works 99.999% of the time)
I wouldn’t describe that as “disable asserts in production build”, rather “compiled into production builds and attempting to reestablish correct functioning as rapidly and reliably as possible”. (Which is what I do).
Handling the problem random resets in communicating subsystems is a whole ’nother conversation on how to do it right.
I agree that this bounded restarts add some serious complexity. So far choosing the simpler approach (reset whole device) was good enough for us in terms of availability.
Erlang is certainly optimized for availability and (to some degree) for real-time applications. So I believe it can be a great inspiration. I have no experience with it though. As far as I know Erlang has not been used for safety-critical stuff though.
Making the Most of Asserts
I record the backtrace and the precise program counter. That’s it. Nothing more. OK. Also a time stamp can be useful and maybe a couple of uintptr_t words the programmer can add to help debug.
Care needs to be taken as the optimizer will sweep all common code into one call to the assert utility, then you don’t know which of several asserts in a function fired! We ended up going with a gcc asm oneliner to get the precise PC.
https://www.gnu.org/software/libc/manual/html_node/Backtraces.html
My biggest problem with code that checks for malloc returning null and attempting to handle it….
…usually it is untested, buggy, and somewhere along the line uses malloc to do it’s job! (Guess what lurks in the depths of a printf?)
The next problem on a system with swap…. these days your system is effectively dead/totally dysfunctional loong before malloc returns NULL!
The light weight IP stack uses pool allocators with quite small pools for resources that may have (potentially malicious) spikes in usage. But then you will find all over it the attitude “this is an IP packet, somethings wrong / I can handle it / I don’t know enough / I don’t have enough resources / ….” I’ll just drop the packet. If it matters the higher layers will retry.
Another good pattern is to malloc everything you need for this configuration at initialization time … at least then you know then and there that that configuration will work… if you can’t, you reboot to a safe configuration.
When Not to Assert
Never assert on invalid input from users or external untrusted systems. If you do, you open yourself to denial of service attacks (and pissed off users).
Design by Contract
Please read and understand https://en.wikipedia.org/wiki/Design_by_contract
I regard DbC as the most important concepts in producing correct software, and has a lot to say about asserts.
Lots to chew on here. Thanks for sharing about DbC. Do you have any good books on the topic you’d recommend?
In embedded use cases, you cannot always spare the code size for the unwind tables. This makes the
backtrace
builtin less than useful. Instead, we usually grab PC + LR.You are absolutely right, we’ll add a note to the post
A good point, though the systems we cover here do not fall under that definition
The canonical grand father book is “Bertrand Meyer. Object-oriented software construction. Prentice Hall, 1997”
Sadly it’s such a fundamental dating back to early program proving papers… the Comp Sc types regard it as “done to death” and are picking over obscure corners… and the proprietary types feel Meyer and Eiffel have cornered the market…
Sigh.
I wish I could point you to a modern well written tome focused solely on DbC and not on some library or language.
If you find on, please tell me.
Don’t need unwind tables.
For the particular embedded cpu we’re using, libc didn’t have support for backtrace, so we’re rolled our own walking up the framepointers picking out the return addresses from each frame.
I should imagine the arm glibc would work out the box.
A gotcha is it has, ahh, imprecisions thanks to the optimizer.
If the optimizer can at all get away without creating a frame, it will. Thus the real call graph may be A calls B calls C calls D, but the optimizer elided the frames for B and D … the backtrace will show A called C and died amazingly somewhere in D.
I bound the number of return addresses we store on an assert failure to something small but useful. (5 I think)
ARM-v7m doesn’t use a frame pointer, which is why implementations of
backtrace
rely on unwind tables in that case.For what it’s worth, our approach is: ship the stacks back, and on the web backend grab the unwind tables from the symbol file (elf) and work out the backtrace there.
This is extremely cool. I’d love to know more about the complexity of the system required to write these.
Very interesting. When using opaque structs in APIs in the past, I’ve aways provided
_create
and_destroy
functions that allocate them on the heap. API users would regularly leak memory, and allowing the use ofalloca
some cases would have made the API simpler to use.I’m a long time user of OpenOCD, but lately I’ve been getting more excited about ARM’s PyOCD project. It’s a very well structured piece of code, written in Python. It’s very easy to extend, though it doesn’t support quite as much hardware as OpenOCD. https://github.com/mbedmicro/pyOCD
This is very interesting, I really like the Micro:bit use of CMSIS-DAP, it has two chips on board, one is the programmer and USB mass storage! You can flash by literally drag & drop mindblown.
It looks like I will soon need a new adapter to try this out, unfortunately mine are not supported.
Is that correct? Yes, individual stack frames may grow larger (but fewer of them), but I wouldn’t expect total stack usage to grow.
This is based on our experience enabling LTO on a few firmware projects in our career. You can imagine some cases where multiple branches in your execution tree are collapsed in a single stack frame, growing the worst case stack usage.
It seems possible.
Consider a contrived example:
a() - 32 bytes of stack
b() - 32 bytes of stack
c() - 64 bytes of stack
a() calls both b() and c() once.
If b() and c() are inlined into a() and the compiler doesn’t reuse stack slots then the maximum stack depth is now 128 instead of 96.
Possible, sure, though in my experience gcc is pretty clever (aggressive, perhaps) about reusing stack slots. Even with a not-super-recent version of it (I think it was circa 5.1 or so), I recall a few years ago being impressed to discover that it had merged two distinct local arrays to share the same stack space, despite the fact that they had overlapping lifetimes – it had noticed (correctly) that while the lifetimes of the two arrays as a whole overlapped, the lifetimes of each individual corresponding pair of elements (e.g.
A[0]
andB[0]
) did not, and hence arranged things so that the same underlying chunk of memory started out as arrayA
and gradually, element by element, became arrayB
.Hmm. Ok, you’re probably right. Thanks.
this is a really great reference, thanks!
Thanks! We had to learn about this stuff the hard way - it’s odd there isn’t more content about BLE out there!
Coming from a software background and having moved into embedded, I’ve been surprised how many firmware teams I’ve contracted/consulted to in the past who don’t do any CI and don’t always see a lot of value in it. Can feel 10ish years behind a lot of other software practices.[*] So thanks for writing this, @fbo.
Beyond setting up automated builds, which are not that different to setting up automated builds for “big software”, the next thing I see embedded engineers often struggle with is structuring their code so it’s possible to run meaningful tests on the host. In case you’re looking for encouragement for new blog post topics. ;)
[*] Insert joke about how still writing everything in C is 25 years behind a lot of other software practices.
@projectgus - This is my experience as well. I’d go one step further: forget CI, embedded teams often don’t use a version control system, or do code review (which I see as another run down the ladder from CI in terms of eng. sophistication).
For teams that do consider CI, the hurdle often is that they believe they need to do hardware-in-the-loop testing. This raises the complexity bar a good bit, and dooms many CI deployments. In reality, software-only tests run on x86 are much better than nothing, and take no more than an afternoon to put in place. I’m hoping that by spreading that gospel, I can help teams in the industry make the jump.
We’re writing a post on how to unit test firmware, hopefully we’ll hit the notes you’re looking for. We also welcome guest writers ;-). I’d be thrilled to edit / polish any post on embedded topics folks here want to write.
I usually use an Aardvark for this (https://www.totalphase.com/products/aardvark-i2cspi/). Though perhaps it is too slow for this specific use case.
I’d love to read more about OpenRISC vs. RISC-V. Anybody have experience with both?