1. 9
  1.  

    1. 5

      An empirical increase in the rate of vulnerabilities is a necessary condition because some languages technically meet the first two criteria, but the potential for vulnerabilities is sufficiently marginal that in practice they do not occur at a problematic rate. Go is an important example of such a language.

      What is the increase relative to? How large does it need to be? This says “empirical increase”, but few languages have a safe and unsafe version in popular use. Of the languages that do, as the author notes, the safe version is too specialised to be worth talking about. And if this property were measurable, just one memory safety vulnerability would be enough to classify any language as “memory unsafe”. (As it should be! But the post is explicitly trying to avoid this.)

      I know I’m nitpicking, but the author is trying to sound like he’s formalised something (the notion of a theoretical problem being practically insignificant) without bothering to actually do it. In a post that’s dedicated entirely to increasing the precision of the discussion, I find that a little weird.

    2. 4

      Is the article meant to end just two paragraphs after the heading “What problems are we addressing?”, or… am I not seeing more because it wants to load further paragraphs lazily with JavaScript?

      1. 2

        It’s not just you, it seems to just end there, drifting off into outer space on a wisp.

    3. 3

      We’re interested in languages that have this behavior by default…

      That phrase “by default” does a lot of heavy ontological work. Some folks work in safe subsets “by default,” as with the example of Rust; some folks work with restricted imports “by default,” as with the examples of Java and Python. But not everybody applies those limitations to their projects.

      1. 5

        This seems vacuous. We live in the real world; the world is not safe.

        1. 1

          What are the external pragmatics of e.g. a pocket calculator? Not all computers are hooked up to missile systems, the Internet, or other common examples of dangerous external I/O. And yet we probably want memory-safe pocket calculators, so that they do not turn into weird machines. (Or perhaps we want memory-unsafe pocket calculators, so that we can take control of our Texas Instruments calculators. Who can say…)

          1. 4

            The ‘external pragmatics’ of a pocket calculator are that if I put it in the oven, it’s going to end up in a pretty weird state no matter what.

            The goal is to reduce defects. When your tools are less pointy, it’s harder to cut yourself, but you can’t get anything done with a blunt saw.

            For what it is worth, I think it would be a lovely thing if the only popular computing systems were ones in which the capability to perform ‘memory-unsafe’ operations were doled out only sparingly and granularly. But you haven’t argued that, or indeed anything at all; only a vague platitude.

      2. 4

        I’m reading that “by default” to imply two important positions:

        1. The presence of FFI intrinsics does not inherently render a language memory-unsafe.

        2. Whether a language is memory-unsafe depends on the idioms and practices applied in its natural state, not to some carefully-crafted subset.

        For (1), if the presence of ctypes or unsafe { } meant that Python and Rust aren’t memory-safe, then it would be impossible by definition for any memory-safe language to be used for most programming tasks. There must be some way to either call out to external libraries (such as libc) or invoke OS syscalls directly, otherwise writing even the most trivial “hello world” binary would be impossible.

        For (2), it is possible to define subsets of nearly any language that can be proven memory-safe. C is memory-unsafe because it’s possible to write int x = 0; int *y = &x; int z = y[100]; – the existence of specialized safety-oriented dialects such as MISRA-C or Barr-C does not change this fundamental property of the C language.

        1. 8

          I fundamentally disagree with this framing. The reason that memory safety is such a problem is that violating it invalidates axioms that unrelated parts of the program depend on. If I hold a pointer to an object and do not share it, the contents of that object will not change unless I change it and it can hold secrets that are not exposed to the rest of my code. If I hold a pointer to an immutable object, that object will not change. I can build complex systems starting from these assumptions. If there is a use after free bug or an out of bounds memory access then these assumptions no longer hold. It doesn’t matter where the memory safety bug is, if it’s in the same protection domain (address space on conventional systems) then all bets are off for the whole of the rest of my system.

          This is why CHERI is designed to provide guarantees that work across languages. With the CHERI JNI work, I showed that you can implement FFI from a type safe language into arbitrary native code without allowing it to violate type safety. With CHERIoT, we give you a strong set of memory safety guarantees that everything from assembly on up must respect, which the compilers can use to ensure foreign code does not violate their assumptions.

          With Verona, we are building the interoperability model on top of the region abstraction, so we can express a set of objects that can be corrupted by a call to a foreign library. We can then implement that isolation with SFI, process isolation, CHERI, or any other mechanisms that come along.

          The fact that a problem is hard to solve is no excuse to redefine it as a simpler one and pretend that you’ve solved it.

          1. 3

            As you may have noticed from reading the linked article, this discussion is about memory safety in the context of programming languages that target hardware people are using today (largely x86 and ARM).

            If you want to implement fine-grained address space subdivision and pointer authentication in custom hardware and rebase your entire software stack on top, then that (1) sounds like a fun project, and (2) will not do anything to help the hundreds of millions of people who are affected by (for example) buffer overflows in their browser’s image codecs.

            Also, you may want to re-read my post to make sure you understand what you’re “fundamentally disagree”ing with.

            1. 5

              As you may have noticed from reading the linked article, this discussion is about memory safety in the context of programming languages that target hardware people are using today (largely x86 and ARM).

              I did read the article. Verona also targets existing hardware (and future hardware) and is building a foreign library interface that does not compromise memory safety. There’s other work on JNI using NaCL and other in-address-space sandboxing mechanisms (such as lightweight SFI using MPK and CET) that has demonstrated that this is not infeasible. CHERI just makes it faster (in part by eliminating defensive copies and indirection layers).

              I don’t like FFI as a concept because it frames the interoperability at the wrong level. Cross-language interoperability is easier if you expose a foreign library interface, rather than a foreign function interface. You can then attach a notion of a context for the foreign code and have some language-level abstractions that you can attach different isolation mechanisms to that allow safe interoperability with unsafe languages.

              1. 1

                The thing is that you’re describing a research project that does not (and cannot) provide memory safety on current-gen consumer hardware, which is completely unrelated both to the article and to my post.

                I’d be happy to be proven wrong here. For example, given the following function signature:

                uint32_t safe_strnlen(const char *buf, uint32_t maxlen);
                

                how would you use Verona to create a shared library that can be linked into a C program, then run on a currently available consumer x86 CPU, and which will not read outside the allocation of buf regardless of its content or the value of maxlen?

                For example, if invoked like this:

                char a = 'a';
                uint32_t garbage = safe_strnlen(&a, 10);
                

                With CHERI hardware the answer is easy – buf is associated with additional metadata about its bounds – but that would be, as you note, solving a different-and-simpler problem. My desktop’s CPU doesn’t have CHERI support.

                Thus, the desire to enforce memory-safety at the language level, with annotations (Rust’s unsafe, etc) at the points where the language’s safety model must be augmented by human inspection.

                I don’t like FFI as a concept because it frames the interoperability at the wrong level. Cross-language interoperability is easier if you expose a foreign library interface, rather than a foreign function interface.

                I’m using “FFI” in the industry sense of a language-native mechanism to invoke code written in a different language. It doesn’t imply that the interface needs to be function-oriented, and there are many examples of FFIs that operate at a less granular level.

                1. 2

                  how would you use Verona to create a shared library that can be linked into a C program, then run on a currently available consumer x86 CPU, and which will not read outside the allocation of buf regardless of its content or the value of maxlen?

                  That’s not what Verona guarantees, it guarantees that memory safety bugs in foreign code do not impact memory safety of the safe code. Our baseline for providing isolation on unmodified operating systems and unmodified hardware is here. With some fairly small OS changes (an intern prototyped these in Linux a couple of years ago), we can make it a lot faster. Firefox ships with an SFI mechanism that makes it much faster but doesn’t allow assembly code.

                  Note that most of this is based on work I and others did prior to Rust existing, we’re just working out how to cleanly surface it in a language it’s Verona.

          2. 1

            I’m interested in learning more about CHERI JNI. In the JNI-based library bindings that I’ve seen and written, it’s common to store native pointers as Java longs (64-bit integers). I assume that pointers on a CHERI system can’t be converted to and from longs. Is there already a better way to store native pointers, that will ensure that the library is portable to CHERI systems? Or did CHERI have to extend JNI for this?

            1. 1

              I don’t have any experience with CHERI JNI and the paper doesn’t describe representing native pointers in Java, but in general the solution is to separate the pointer types from the integer hierarchy. The pointer type can then be sized appropriately for the target platform – on CHERI they’d be long enough to store the entire (capability, address) tuple.

              For Java it’s a little tricky because the compiled code is platform-independent, so you’d probably end up with every native pointer being boxed. In practical terms I think this means a wrapper around BigInteger (or byte[]), with some helpers on the JNI side to reconstitute the original pointer (including check bit). The resulting code would be different from current JNI best practices.

            2. 1

              Here is the paper. We didn’t do that, but allowing you to attach pointers to objects might be possible. I enforced quite a strong isolation model and didn’t allow any communication between JNI sandboxes other than via Java, so passing pointers between them is not allowed. Sandboxes had one of three scopes (this was a prototype, not necessarily how I’d do it in a real environment):

              • Global sandboxes persisted for the entire program lifetime. Every call was in the same sandbox. You could use a map from Java object IDs to pointers in the sandbox’s global.
              • Object-scoped sandboxes were associated with a single object. Every native method on an object with this scope runs in the same sandbox, different instances of the object get their own sandboxes. You can store things in globals for this and they’re private to the object that owns the sandbox.
              • Method-scoped sandboxes give you a completely fresh sandbox for every call. This is mostly for ephemeral things, but avoids any of the temporal-safety overhead because you do temporal safety for the sandbox by destroying the sandbox at the end of the call.

              Java objects that were passed into C code were passed as sealed capabilities. These then participated in GC (each compartment was scanned to find sealed capabilities and these were treated as roots). Java NIO buffers were similarly handled: they were exported as capabilities and not GC’d until those capabilities were gone from the native code.

        2. 1

          Sure. (2) first, since I fully agree; e.g. the subset of C without pointers is quite a bit safer than one might expect, and can express some non-trivial algorithms.

          For (1), though, I disagree. Every library bound via FFI is a new attack surface which needs to be considered in threat modeling. That even applies without FFI; consider e.g. Expat, which is bound by many implementations of popular languages and is a perennial security hazard.

          Further, I think it’s worth examining what we mean by “most programming tasks.” Sure, the typical Rosetta Code exercise is going to involve I/O; but I suspect that quite a few workloads can be expressed not in terms of binding lower-level libraries, but in terms of being hosted by a high-level managed runtime. The typical GPU and DSP workloads are managed in this way, for example.

          1. 1

            For (1), though, I disagree. Every library bound via FFI is a new attack surface which needs to be considered in threat modeling.

            To clarify, do you believe that Rust or Java are memory-safe languages?

            If you contracted an external developer to write a JPEG decoder and specified that it must be “written in a memory-safe language”, and they delivered one written in pure Rust, would you accept that as meeting the terms of the contract?

            Because the answer to both questions is only “yes” if you agree with (1). Rust supports FFI (it can invoke code written in other languages), therefore Rust is a memory-safe language if-and-only-if the presence of FFI functionality does not disqualify a language from being memory-safe.

            If you really do disagree with (1) – if you think that Rust, Java, C#, and so on are not memory-safe languages – then your position is really far outside of the mainstream.

            Further, I think it’s worth examining what we mean by “most programming tasks.”

            I mean, I can do a survey of the binaries and shared libraries on a typical Windows/macOS/Linux computer and count what percentage of them invoke a syscall (directly or via libc / msvcrt / libSystem, etc). It’s likely that number rounds to 100%.

            I suspect that quite a few workloads can be expressed not in terms of binding lower-level libraries, but in terms of being hosted by a high-level managed runtime.

            I’m having some trouble understanding what this has to do with the conversation. We’re talking about systems programming, so the existence of (for example) GPU compute shaders or sandboxed bytecode is out of scope.

            1. 2

              Rust as it is practiced is not memory-safe; I started a thread about it on the fediverse recently. Java’s memory-safety is directly related to which unsafe, javax, and sun packages are available. I don’t know enough about C♯ to say for sure either way.

              The entirety of “systems programming” is upside-down, yes. We start from a carefully-delimited finite state machine (the hardware!), and invent several mismatching abstractions which generate weird machines. In particular, we keep using the antipattern of defining an abstract machine (the C abstract machine, the JVM with JNI, LLVM, etc.) and then mapping it onto hardware in a weird fashion, hoping that our compilers implement suitable homomorphisms.

              It’s not just the machine, either. Our entire approach to loading code is backwards. Right now, the typical linkage for libraries is via those same low-level “systems” primitives, and involves directly mapping bundles of trusted code (read: code whose vulnerabilities we inherit!) into memory spaces on a per-process basis. However, there are safer methods of loading code, and we can define safety properties in terms of code-loading behaviors; see e.g. p93 of Mark Miller’s thesis for one approach. It can be shown that memory-safety follows from Miller’s “loader isolation” properties, in the absence of explicit unsafe primitive operations.

              We’re talking about systems programming, so the existence of (for example) GPU compute shaders or sandboxed bytecode is out of scope.

              Let me explicitly refute this with a reminder: our CPUs execute sandboxed bytecode natively. GPUs are merely vector CPUs without I/O ports.

              1. 3

                Yeah, you’re using a definition of “memory-safe” that differs from pretty much everyone in the world. I think I understand sort of where you’re coming from, but if you try to use that definition in conversations with other people then it’ll just lead to mutual confusion + frustration.

                From your thread:

                “language L is memory-safe,” which is generally taken to imply that all programs in L are memory-safe by construction.

                I’ve never met anyone who would use this definition, and in ~25 years of professional software development I’ve not encountered a language that would both (1) qualify under your definition of memory-safe and (2) be usable as a general programming language for compilation to executable binaries. The closest one of the languages I’ve used is GHC Haskell with the {-# LANGUAGE Safe #-} extension, but even that dialect can’t guarantee memory-safety of all programs written in it.

                Out of curiosity, if you were required to coin a new term for what the rest of the world calls “memory-safe”, what would you use? Something that describes the availability of machine-checked assertions about the lifetimes/bounds/mutability of memory locations, plus special-purpose intrinsics for memory-unsafe operations?

                1. 1

                  I don’t really buy arguments from incredulity or by popularity. Part of being a programming-language designer is a willingness to consider that perhaps the typical programmer is wrong in their beliefs and practices.

                  Professional software development is not concerned with correctness. As a result, there are very few popular languages which even approach memory-safety, although they do exist; from today’s TIOBE top 20, I see SQL, PHP, Scratch, and R.

                  I don’t understand why (2) is a serious obstacle, in theory; after all, any Turing-complete language should be recompilable, given a sufficiently universal platform. Instead, I think I’d examine why compilation to native binaries is ever desirable. The typical rationale is speed, but speed only emerges if the compiler is sufficiently smart, and I think that what we’re really after is some sort of magic optimizer which turns high-level code into low-level code. When the typical modern compiler suite breaks that into two distinct tasks, we start making the mistake of thinking that we must follow that modern compiler pipeline in order to have speed.

                  Like, to make this point sharper, consider the primitive recursive functionals, which form a language where every expression denotes a total function. This language is obviously (1). So, surely (2) is just a matter of engineering effort, right? And yet there is a paucity of theories for effectively compiling this language; the best one I know of is graph reduction, which happens to be the dominant strategy for Haskell compilers. There’s nothing wrong or bad with total functions, and they can be found as a subset of many popular languages, so I’m strongly tempted to lean on (2) as the less reasonable horn of this dilemma.

                  That said, RPython exists and offers a safe high-level API. Write Python 2.7 without ctypes, and obtain an efficient native binary. (1) and (2) can coexist.

                  I don’t grok what the rest of the world calls “memory-safe,” to be honest. It reminds me of the definition of “heuristic” given by Jeff Erickson: “A heuristic is an algorithm which is wrong.”

                    1. 3

                      TIL about PHP’s FFI. Thanks. I don’t know enough about R to know whether their interface is available by default.

                      RPython can be used without rffi, but I won’t belabor the point, other than to clarify that all of the JIT functionality is available without FFI.