1. 79

  2. 61

    With regards to use after free and double free, this is solved in practice for heap allocations. The basic building block of heap allocation, page_allocator, uses a global, atomic, monotonically increasing address hint to mmap, ensuring that virtual address pages are not reused until the entire virtual address space has been exhausted. In practice, this is a very long time for 64-bit applications. The standard library GeneralPurposeAllocator in safe build modes follows a similar strategy for large allocations and for small allocations, does not re-use slots. Similarly, an ArenaAllocator backed by page_allocator does not re-use any virtual addresses.

    This covers all the use cases of heap allocation, and so I think it’s worth making the safety table a bit more detailed to take this scenario into account. Additionally, as far as stack allocations go, there is a plan to do escape analysis and add this (optional) safety for stack allocations as well.

    As far as initialized memory goes, zig forces you to initialize all variable declarations. So an uninitialized memory has the word undefined in it. And in safe build modes, this writes 0xaa bytes to the memory. This is not enough to be considered “safe” but I think it’s enough that the word “none” is not quite correct.

    As for data races, this is an area where Rust completely crushes Zig in terms of safety, hands down.

    I do want to note that the safety story in zig is under active development and will be worth checking back in on in a year or two and see what has changed :)

    1. 17

      It seems slightly unfair to credit the Zig language for features from its allocator library, since C is just as capable of using alternative allocators, and plenty of safer ones exist.

      If the Zig allocator never reuses freed heap blocks, that seems like it’d cause a ton of heap bloat, since a chunk acquired from VM can’t be unmapped until all blocks in it have been freed, and normal fragmentation means that can take a long time. Is this really practical? (I’ve read about techniques where multiple partly empty pages can be cleverly remapped to the same page if the live blocks don’t overlap, but I’m not aware of that being used outside research projects.)

      (Not to mention that this only works on OSs with robust virtual memory, on CPUs with an MMU, ruling out all embedded systems below the Raspberry-Pi scale.)

      1. 22

        Sure we can categorize stuff into language/std lib, but if you ask the question, “how safe is zig?”, then that’s asking about the common path use case right? That includes the std lib, common practices, and the actual experience that someone has. For example, with C you have libraries calling malloc/free, but with Zig you have all libraries accepting allocator parameters.

        Regarding the practicality and heap bloat, so far we are only at the proof of concept stage. I don’t have a set of benchmarks I can point to to definitively answer the question, but I can say that empirically, it’s fine.

        It’s not my intention to make a blanket statement either way about safety. I think that the words “none”, “runtime”, and “comptime” are not granular enough to communicate the state of affairs.

      2. 9

        page_allocator, uses a global, atomic, monotonically increasing address hint to mmap,

        Does this not work against ASLR (for data, not just code)?

        1. 10

          For the first allocation, it passes 0, so the kernel can use ASLR to decide the base address.

          After that, the addresses become predictable. However, I noticed empirically that the Linux kernel consistently gives back the same page that you previously unmapped, leading directly to an even more exploitable pattern than avoiding address space re-use.

          It might be interesting to use some kind of PSRNG for address hints. I have not looked into this yet.

          But I also want to note that I think ASLR is kinda lame. It complicates the runtime of a process but the security it adds can still be pretty easily defeated. Not a satisfying hardening to me.

        2. 6

          page_allocator, uses a global, atomic, monotonically increasing address hint to mmap,

          I’ll mention that.

          The standard library GeneralPurposeAllocator in safe build modes

          I was under the impression that the performance/fragmentation cost of the GPA was fairly substantial, although I haven’t tested it yet. In particular I’d be concerned that not reusing slots means I would end up holding on to a lot of mostly-empty pages.

          Do you have any measurements of the impact of eg swapping the bootstrapped compiler between GPA and libc?

          And in safe build modes, this writes 0xaa bytes to the memory.

          I had considered adding this but the docs say it’s only in debug mode. Does it also happen in release-safe?

          worth checking back in on in a year or two and see what has changed :)

          I’m looking forward to it :)

          1. 4

            I have not attempted a high performance implementation that still provides safety measures - the current implementation is so far optimized for debugging only. I’m sure such a benchmark would be significantly slower than what is possible, let alone when compared with allocators which do not provide such safety features.

            Hope to report back on this soon with some hard data!

            1. 2

              Awesome :)

              Feel free to ping me and I’ll update the article.

              1. 1

                I’d love to hear about this! I’m curious, do you allocate a new page for each allocation, or still chunk the page up for small allocations, and then free the page when the last allocation in it is freed. I figured the only way to ensure a segfault on use-after-free would be a page-per-alloc system, but that sounds like it could be very wasteful. You do still get a guarantee that you’re not pointing into something else with the second scheme, but a hard “use after free will crash” guarantee is very desirable IMO.

                1. 3

                  Still chunk the page up for small allocations, and free the page when the last allocation in it is freed. You’re right, it would be much nicer to always get a segfault! Unfortunately I think that wastes too much RAM.

                  Here’s the code, the implementation is only 600 lines long.

          2. 19

            I think under-appreciated aspect is that Rust not only prevents safety issues from being exploitable, but in many cases prevents them from happening in the first place.

            Contrast this with C where safety efforts are all focused on crashing as fast as possible. Things are nullable, but you get segfaults. Buffers are unsafe, but you have crashes on guard pages and canaries. Memory management is ad-hoc, but the allocator may detect an invalid free and abort.

            Rust in some places ensures safety by panicking too, but it also has plenty of tools that guarantee safety by construction, like move semantics, borrowing, iterators, match on options and results. So the bar is higher than merely “not exploitable because it crashes”. It’s not exploitable, because it works correctly.

            1. 10

              Interestingly, “crash early, crash often” is actually AFAIU an approach that can also lead to really robust software when properly embraced, as in Erlang. Personally I’m not convinced C is a particularly exemplary follower of this philosophy, at least as seen in the wild. With that said, I sometimes wonder if this isn’t something of a blind spot for Rust - i.e. stuff like cosmic rays (which I read start becoming noticeable at some scales), or processor bugs and attacks based on them. Maybe some of those risks can be mitigated with stuff like ECC memory? I’m certainly not an expert in this area FWIW, far from it. Just when thinking, I sometimes wonder if the Erlang philosophy is not the winner here in what it can achieve in the biggest horizon of robustness/resiliency (?)

              1. 7

                Rust has the “crash early” philosophy too (asserts, indexing panic, overflows can panic), but it just doesn’t need it as often at run time. You wouldn’t want to crash just for the sake of crashing :)

                Rust has all of C’s protections too. They’re added by LLVM, OS, and hardware.

                1. 20

                  Erlang’s “Let It Crash” philosophy is more than just ‘die a lot’ – the other part is that, recognizing that software can fail for a wide variety of reasons but that those reasons are frequently sporadic edge cases, you can get huge operational wins by having an intentional and nuanced system of recovery in-platform.

                  In practice what this means is that in Erlang/OTP, you frequently have tens, thousands, or millions of concurrent processes, as you might in an OS. Many of these processes are executing business logic, etc. But some of them are ‘supervisor’ processes, which detects when one of the processes it is monitoring has died, and then attempts to restart it (perhaps with a maximum number of attempts in a given time period). This is recursive; for complex applications you might have supervisor trees that are several layers deep.

                  The benefit here is that you start by writing the golden path. And then, sometimes, you just stop. Get illegal JSON input? Explode. Get an HTTP request you didn’t expect? Explode. Frequently, pragmatically, you don’t need to write the code that handles all of the garbage cases if your supervision tree is solid enough and your lightweight processes start ultra fast.

                  This philosophy has downsides – if the processes are carrying a large amount of important internal state, then sometimes them dying can be inconvenient because resurrection might involve an expensive rebuild step. So you need to pay more than the usual amount of attention to the stateful/stateless boundaries.

                  Rust at one point had this as a core idea, but abandoned it for, in my opinion, poor reasons. And now lands up grappling with janky concurrency coloring problems, as it unfortunately appears Zig will too.

                  1. 9

                    Rust at one point had this as a core idea, but abandoned it for, in my opinion, poor reasons.

                    The OTP monitoring system seems really really cool. However rust abandoned green threads for a very good reason: performance. When the goal is to be able to maximize performance, and compete with C++, you can’t exactly ask for heavy runtime with reduction counters and all the bells and whistles of BEAM, useful as they are.

                    The “coloring problem” caused by async stems from the same root causes: you can’t really have resizable stacks and lightweight threads in a language that is supposed to be able to run without even a heap allocator (“no-std” in rust parlance). That’s better left to higher-level languages like java or Go (or python if they had made less bad choices).

                    1. 5

                      Here’s a more detailed explanation by Graydon Hoare, the original creator of Rust.

                      In short: Rust’s removal of green threads was for two reasons: 1) performance (as you mentioned), 2) difficult interoperation with C. Given Rust’s design constraints, including that you shouldn’t pay for what you don’t use, and that interoperation with C is expected to be easy, this approach didn’t work for Rust.

                      Edit: Replaced link to thread with a link to a collection, as Twitter’s default threading makes it easy to miss the second half.

                2. 5

                  Crash-early is fine, as long as the process has a supervisor process that promises to take care of restarting it. That’s the part that seems missing in a lot of systems (in C or otherwise.) Often it just seems kind of ad-hoc. Having implemented some IPC in C++ in the past using POSIX APIs, I remember it being nontrivial for the first process to handle the child’s death (but I may be misremembering.) Its certainly not something done for you, the way it is in Erlang.

                3. 4

                  Things are nullable, but you get segfaults.

                  If you are lucky. Note that this isn’t guaranteed and doesn’t always happen.

                  1. 1

                    Yes, the worst part is when something overwrites some other (valid) memory, causing an error when that memory is next accessed (or worse no error at all). Pretty much your only recourse at that point is to whip out hardware watchpoints (which IME tends to be buggy and tank performance).

                4. 9

                  Rust should have a few more caveats for integer overflow. In my projects, I sometimes enable clippy lints that flag unsafe integer math, search for “integer” which CI rejects PRs due to, and I often use the saturating and checked methods on integers to handle different cases of bounds interaction. Additionally, runtime bounds checks may be enabled by configuring them on a release profile in the Cargo.toml metadata (these are enabled on debug builds by default).

                  1. 2

                    Technically the article mentions it in a footnote, but yes this is a particular issue in Rust. You cannot add/subtract untrusted input to a number and then check validity as this opens you to panics/DoS unless you take care to use the saturating variants of these operations. I do agree with this approach but there is that element of frustration when a bug report comes up: “well if it just overflowed this wouldn’t have been a problem, would it?”

                    1. 2

                      I’ll add that, thanks.

                    2. 4

                      It’s worth keeping in mind that this worldview only includes memory safety in the definition of “safe”. Plan interference, deadlocks, SQL injection, or other high-level “interleaving” safety problems are not considered. The reason I bring this up is because memory safety is broadly a solved problem for any language implemented with a GC, and so the original article implicitly limits Zig’s safety to a specific niche where GCs are not available.

                      1. 4

                        Sorry, but wtf is “plan interference”? The paper doesn’t even bother defining it, even though it’s not, afaik, a common term.

                        SQL injections and deadlocks are, of course, a risk in every language, to the best of my knowledge.

                        1. 7

                          We have to define “plan” first. Originally, capability theorists were informal, and a plan was simply what a programmer intended for a chunk of code to do. We can be far more formal without losing the point: For a chunk of code, its plan is the collection of possible behaviors and effects that it can have in any context.

                          We often want to explicitly interleave plans. For example, suppose that we want to compute the sum of a list in Python 3:

                          s = sum(l)

                          And also we want to compute the arithmetic mean:

                          a = sum(l) / len(l)

                          Then we could deliberately interfere the two programs with each other to create a composite program which computes both the sum and the arithmetic mean, sharing work:

                          s = sum(l)
                          a = s / len(l)

                          Plan interference is a nebulous but real class of bugs that arise from the fact that, when we interleave programs together from component programs, we also interleave their behaviors, creating complex plans with non-trivial feedback and emergent new behaviors. If the linked paper seems to have trouble explaining itself, it’s because, as I understand it, this paper was the first attempt to explain all of this with sufficiently-convincing examples.

                          To summon the examples from the paper from E to Python 3, it is sufficient to imagine that sum() is locally overridden by a coroutine. Then, evaluation of sum() might suspend execution of the stack frame for an indefinite period of time. In my example, this is harmless, but the paper explains how easy it is for this to become a problem.

                          What we think is fundamental is a matter of perspective. I currently think that SQL injection is an API problem, for example, rather than a fundamental limitation of string-handling. In contrast, I think that plan interference and deadlocks are both fundamentally impossible to prevent at compile time. I could be wrong about all of this.

                      2. 3

                        Language facilities like defer can be considered a mitigation for some causes of use-after-free (and double-free?).

                        1. 2

                          I believe in the “use after free” and “double free” rows, the C and Zig columns are reversed. Right now they show C as safer than Zig.

                          1. 2


                            Reading markdown tables is hard.

                            I’ve moved those to the next section now anyway, since the performance impact is unclear and it might not be practical to use those in production.

                          2. 2

                            I don’t think the ‘c’ end of the scale is quite fair. Null pointer dereference is ‘safe’ as long as you’re in userland. And overflow and bounds checks are implemented by some environments (asan, tcc). It is an implementation issue, not a language issue.

                            1. 13

                              Null pointer deref is not necessarily safe. Eg:

                              int* array = func_returning_null();
                              return array[a_very_big_number];

                              Depending on the system I’m running on (is there an invalid page around 0 / how long does that extend), and the size of a_very_big_number, this could well run over into valid memory and cause chaos. In practice, its normally fine though, I agree.

                              1. 4

                                The approach of Chrome is that null pointer dereferences with a consistent, small, fixed offset (FAQ) do not need to be treated as a security bug. So array[a_very_big_number] is clearly problematic but e.g. accessing a struct field through a null pointer is mostly harmless and can be expected to result in a non-exploitable crash.

                                1. 2

                                  That’s pretty reasonable. A language feature that would ignore small fixed offsets like that, but insert null checks for large or dynamic offsets would be a nice thing to have.

                              2. 10

                                In fact, it is not simply not safe, it’s even worse than the article makes it seem like: dereferencing a null pointer is undefined behavior, which allows the compiler to produce arbitrary code in case it detects that a null pointer is being dereferenced—in particular, the compiler can safely assume that if a pointer is ever dereferenced, it can’t be null, so null checks can be removed altogether. An example: https://blog.kevinhu.me/2016/02/21/Avoid-Nasal-Damons/

                                EDIT: a more relevant example: https://software.intel.com/content/www/us/en/develop/blogs/null-pointer-dereferencing-causes-undefined-behavior.html

                                1. 2

                                  -fno-delete-null-pointer-checks. Yes, it’s dumb that you need it, but it’s there.

                                  (Related: pick your poison from among -fwrapv and -ftrapv.)

                                  1. 3

                                    Sure, but we are talking about C-the-language here, not C-with-tooling-specific-workarounds-for-nasty-language-issues; otherwise, this would be a very different discussion, as then we could also mention the plethora of analysis tools.

                                    1. 0

                                      My impression is we are talking about programs written in c. Of course you can run analysis and verification tools on c, but most existing c programs will not pass muster regardless of how safe they are in practice. On the other hand most code can be compiled by gcc or clang with corresponding flags. (And for platforms not supported thereby, compilers—e.g. sdcc—are unlikely to optimize in ways that make it an interesting consideration.)

                                      1. 3

                                        I see your point: it is true that adding -fno-delete-null-pointer-checks is free in the sense that it fixes bugs without requiring anything else from the programmer, whereas running additional checkers requires changing the program.

                                        That said, how many C programmers don’t use verification tools but do know to use this flag? I’d been writing C++ (which also suffers from this problem) for three years before learning of the flag, and I did attempt at the time to learn the language deeply. The fact that the flag exists doesn’t make a difference to the plethora of programs written in C that don’t use it already.

                                        So, it is absolutely a language issue that the barrier for writing in C-the-language is so low (just pick up K&R and hack away), but, comparatively, much, much more effort is required for learning the fractal of gotchas and the ways to mitigate them—before which one has to somehow learn that this fractal exists and reading K&R is simply the beginning of learning to write C without being dangerous.

                                2. 5

                                  In addition to the earlier two rebuttals, may I point out that a userspace crash may turn into a DoS vulnerability. Even if the crashed process isn’t user-visible, and a parent process transparently restarts it, the rapid generation of crash logs and/or core dumps can take a toll.

                                  1. 0

                                    That’s a higher-level concern, though, and it affects zig and rust as well.

                                  2. 2

                                    I’ve clarified that the table refers to software as it is typically run in production. The vast majority of c code running today is not doing bounds checks. The vast majority of rust code is.

                                    Also it’s not recommended to run asan in production eg https://www.openwall.com/lists/oss-security/2016/02/17/9

                                    1. 0

                                      Asan should not be used for production, yes. Tcc’s bounds-checking interfaces, on the other hand, are fine.

                                    2. 1

                                      Another reason why it’s not necessarily safe is that this definition of safe depends on the operating system. Newer operating systems will allocate an unusable page at address 0, causing access violations and (usually) crashes. Older OSes or in embedded cases, this may not happen and a null pointer may be treated as a normal pointer (after all, it’s undefined behavior so compilers will happily allow that, unless additional options are turned on).