Threads for bshanks

  1. 5

    It’s worth noting that a lot has changed since this paper. I added a separate address to capabilities so that they could be used as C pointers, for example. I also replaced the complicated sealing mechanism with something much simpler. Graeme Barnes (I think) added sentry capabilities, which let you jump to a target with a lightweight sealing-like mechanism. Jon Woodruff and a few others created the 128-bit capability representation.

    This paper was written when we could run an assembly micro kernel on a CHERI CPU and has just enough support in FreeBSD that we could run something in userspace if it didn’t expect to use capabilities at the kernel interface. We had just enough compiler support that you could annotate some pointers as capabilities, but then you couldn’t do subtraction on the, so you had to keep the, as separate pointers to buffers.

    Now we can run programs where every pointer (including implicit ones such as vtable indexes and frame pointers) are capabilities, run the whole FreeBSD kernel in this mode and KDE on Wayland on top.

    1. 1

      What’s a good thing to read to get an overview of the more recent stuff that you mentioned?

      1. 1

        I’m not sure we have a good overview. Some of our Morello blogs give some context and there are a bunch of tutorial resources on chericpu.org.

      2. 1

        Thanks for clarifying. I find the CHERI website a little hard to navigate TBH and this paper was the best short summary of the ideas I could find.

      1. 6

        I have a 2020 Thelio desktop (AMD; thelio-r2 running GNU/Linux). It isn’t terrible, it’s reasonably quiet, it’s a reasonable size, so far customer support has been great, and i’d buy a Thelio again. But i have three complaints:

        • it doesn’t have any front USB ports
        • insufficient cooling. I wonder why they didn’t just put an additional fan in the side of the case?
          • it can’t handle the preinstalled Ryzen 9 3900X graphics card that i purchased with it; when under heavy load for about 20 minutes (for example, when playing a game), heat builds up and causes the rest of the system to shut down. I have to throttle the graphics card by about 30% to prevent this, at which point it’s not that great, and one of the supposed attractions of a desktop over a laptop was a great graphics card. I was hoping that if i bought a prebuilt desktop rather than building one myself, i would avoid this sort of problem.
          • my preinstalled internal NVMe SSD drive (Sabrent Rocket) crashed in 2022. I noticed it’s mounted right under the GPU, so given the problems with failing to dissipate the heat from the graphics card, i suspect this got too hot over time too.
          • this is ironic because System76 has a blog post about their careful optimization of airflow for cooling in the Thelios; also because they seem to care about the aesthetics of the case (which i don’t care about, but i do care about cooling)
        • it crashes from time to time (the system freezes and the fan starts running at full speed; probably not their fault; but my previous computers (laptops running GNU/Linux) didn’t have this problem – therefore I suspect it’s some problem with the GNU/Linux drivers for the AMD GPU)

        Any suggestions for my next desktop? I’d like something comparable to the Thelio in terms of power and size and quietness, but with some front USB ports, and a high end graphics card that can run at full power, and a minimum of fuss (ie “it just works”, eg sufficient cooling so that it doesn’t ever overheat and shut down, and my hard drive doesn’t crash after two years). I’d prefer pre-built but i’m willing to build it myself; with the 2020 Thelio i went pre-built because i figured if i did it myself i’d screw it up and buy some component that doesn’t work well with GNU/Linux, or put the thermal paste in the wrong place, or not provide enough cooling, or something. But since I didn’t achieve “it just works” with pre-built anyways, maybe i should just build it myself?

        Come to think of it, i should just ask System76 support if it would be feasible for me to replace the case on my 2020 Thelio with an aftermarket case with a side hole for a fan, and front-facing USB ports.

        1. 3

          insufficient cooling. I wonder why they didn’t just put an additional fan in the side of the case?

          it can’t handle the preinstalled Ryzen 9 3900X graphics card that i purchased with it; when under heavy load for about 20 minutes (for example, when playing a game), heat builds up and causes the rest of the system to shut down. I have to throttle the graphics card by about 30% to prevent this, at which point it’s not that great, and one of the supposed attractions of a desktop over a laptop was a great graphics card. I was hoping that if i bought a prebuilt desktop rather than building one myself, i would avoid this sort of problem.

          I’m having this exact same problem and have since I bought the unit. This is SUPER sad since I otherwise love the machine but what the hell is the point of buying a monster desktop that you can’t even push to anything like its full potential.

          I kinda gave up gaming on the beast because running No Man’s Sky at anything but low detail/res settings causese the case to get BLAZING hot to the touch, and then the system shuts down.

          And now I’m stuck for at least another 5-6 years because my desktop budget needs to refill :)

          1. 2

            If I’m buying a desktop in 2022, I’d probably go for off-lease business desktop if I didn’t care much about graphics (as most are SFF). They’re very thick on the ground, fast, cheap, and low-trouble. Whitebox is very tempting, but I’ve had so many miserable and hard-to-debug issues with them.

            Of course, desktop Macs also put a wrench into things value wise. Next time it comes down to upgrade, I’m considering a Mac.

            1. 2

              I’m sad to hear this. I bought two of their laptops (over the years) and both have been extremely strange and unreliable beasts, but I was hoping this could be chalked up to their reluctance to design the laptops themselves. (Apparently they are re-branded imports.) Given the freedom of designing a whole desktop PC from components, they should have been able to do a much better job.

              1. 2

                my preinstalled internal NVMe SSD drive (Sabrent Rocket) crashed in 2022. I noticed it’s mounted right under the GPU, so given the problems with failing to dissipate the heat from the graphics card, i suspect this got too hot over time too.

                This is an annoying anti-pattern common in many motherboards I’m afraid. I believe it’s because NVMe connects directly to the PCIe bus, and so the slot for it tends to take the space that would otherwise be occupied by a PCIe card. A double-width GPU in an adjacent slot will then happily sit right over it. It worked just fine a few years ago, but NVMe drives and GPU’s both now tend to run hotter than they used to.

                1. 1

                  it crashes from time to time (the system freezes and the fan starts running at full speed; probably not their fault; but my previous computers (laptops running GNU/Linux) didn’t have this problem – therefore I suspect it’s some problem with the GNU/Linux drivers for the AMD GPU)

                  Oh my god. I have this exact problem for the entire lifetime of my AMD card. It’s not a Linux problem, I’ve hit (and can deterministically reproduce) this problem on Windows too. The only thing that kinda worked was tuning the fan curves really aggressively to the point that the fans spin up at the slightest 3d rendering. I’ve tried a lot of stuff up to re-pasting the card and nothing helped.

                  Not buying an AMD card again.

                  1. 1

                    What card, out of curiosity? Would be nice to have something to avoid.

                    1. 2

                      An RX590

                  2. 1

                    A good method of guessing how probable cooling problems are with a given computer is look at how much ventilation the case has. Small windows and/or grilles in corners? Trouble. This is just a fact and all case manufacturers create cases like this for some reason. For example, I love the aesthetics of Fractal Design Define cases, but they run hotter and louder than their Meshify cases that have a full mesh front panel.

                    1. 1

                      I think the best move is to use a customizable gaming-focused company like iBuyPower, where you can pretty much spec out whatever you want and they put it together for you, to have them source roughly the same hardware as System76 uses and put it in a chassis with better air vents + fans + front ports; then you install Pop!_OS (System76’s distro, and coincidentally by far my favorite consumer-focused Linux distribution!) on it yourself when the fully built rig arrives in the mail.

                      As long as the underlying CPU/GPU combination is the same, and you’re using a motherboard that has compatible WiFi and Bluetooth, I think you’ll end up with very similar Linux/Pop!_OS compatibility, but better thermals, performance, and longevity. System76 seems to optimize for having an aesthetically-pleasing chassis over thermals, and if you don’t care about the former (or enjoy gaming-style aesthetics, where thermals are an important design consideration) you can get a lot better of the latter. You can probably even control any RGB lighting you’ve had them set up for you, if you’re into that sort of thing, via OpenRGB!

                      One thing I’d stress though: specifically for the motherboard, make sure you’re checking for Ubuntu compat, not “Linux.” WiFi/Bluetooth drivers ship in the kernel, so while the latest kernel may have support for the drivers, that kernel may not yet be used in the latest version of Ubuntu/Pop!_OS. Since Ubuntu is extremely common amongst Linux distros, checking for Ubuntu compat should be fairly easy via Google, and if it’s Ubuntu-compatible it should be Pop-compatible since they use the same kernel.

                      And by using something like iBuyPower you have roughly the convenience of a prebuilt, minus having to install the OS yourself and having to do an upfront check to make sure you’re using a motherboard with WiFi and Bluetooth that work with Ubuntu.

                      You could also just build a desktop yourself! It’s not thaaaat time-consuming. But if you’d rather not spend a day clicking and screwing parts together, and managing a panoply of hardware orders from various websites, that’s valid and there are other options.

                      1. 1

                        I just took a look at iBuyPower on your suggestion, and it seems like they don’t really address the biggest problem of doing a custom build: the research required to pick all of the components out. Snapping the parts together is easy enough, the benefit of a pre-built is not having to select all of the components individually. It does look like iBuyPower has some pre-built machines, but if then you are back to the “might not work with Linux” problem.

                        A lot of gaming focused companies also, frustratingly, seem to top out at 32gb of ram these days. That’s fine for gaming still, but increasingly not fine for a lot of other workloads. I know ram is upgradable later, but you often end up paying for ram you need to throw out (or deal with reselling) because they do 2x16 or 4x8 configurations.

                    1. 1

                      I do not think there is a “simple” collection scheme that is going to meet your requirements of being “simple” to implement. Automatic memory management is actually a hard and complicated problem, as evidenced by the fact that there are constant updates to algorithms and the state of the art.

                      So, I think there are two obvious paths here:

                      1. Accept the dependency and adopt the Boehm-Dehmer-Weiser collector
                      2. Roll up your sleeves and spend the time to research what’s out there, and work through the features of each algorithm and figure out which tradeoffs you can deal with for your purpose.

                      The other option is to build a simple Mark-Sweep collector and do nothing until the language is far enough along that you’re actually seeing it be problematic in the real world. At this point, investing in a rabbit hole will actually pay off because you’ve proven that the language has enough legs that the effort is worth it.

                      1. 1

                        Sorry, I didn’t intend to say that I wanted to do automatic memory management in just a “few” lines of code! I just meant that I would rather have fewER lines of code even at the cost of worse performance. I expect that any solution that detect cycles and has short pause times will still have many lines of code. I edited the question to clarify.

                        Regarding option #1, automatic memory management will be included in the somewhat self-hosting language implementation, therefore if the right answer is Boehm-Dehmer-Weiser, I will have to manually rewrite the Boehm-Dehmer-Weiser collector in my language line-by-line – hence my concern with lines of code in libraries. I’m hoping I can find something shorter than that. Also, it seems like Boehm-Dehmer-Weiser cannot be used in portable code, especially in concurrent, incremental mode; the docs say “Our collector is not, and cannot be, implemented as completely portable C code.” and “The collector uses a mark-sweep algorithm. It provides incremental and generational collection under operating systems which provide the right kind of virtual memory support. (Currently this includes SunOS[45], IRIX, OSF/1, Linux, and Windows, with varying restrictions.)”). This would make it impossible to translate into my higher-level language.

                        Regarding option #2, this post is part of that work.

                        Regarding a vanilla mark sweep collector, my understanding is that that could have very long pause times?

                        1. 4

                          Also, it seems like Boehm-Dehmer-Weiser cannot be used in portable code, especially in concurrent, incremental mode; the docs say “Our collector is not, and cannot be, implemented as completely portable C code.”

                          This is true and will be the case for almost any garbage collector. It needs to be able to scan stack memory, enumerate globals, and so on. Can can do the ‘GC in uncooperative environments’ trick to avoid stack scanning, at the cost of increasing compiler complexity and reducing performance. You can avoid enumerating globals by tracking any that were exposed to your language as implicit roots. Boehm also uses OS-specific hooks for write barriers (e.g. marking pages as read-only to catch traps on accesses). Writing all of that in portable code will require building your own platform abstraction layers over all of these.

                          It would be good to understand why self hosting is a goal for you. In general, it’s a terrible idea for high-level languages because it requires that you provide type-system escape hatches that leave you with a language that combines the performance of Lua with the memory safety of C.

                          1. 1

                            Thanks, what is the “GC in uncooperative environments” trick – are you referring to https://hboehm.info/spe_gc_paper/preprint.pdf ?

                            My plan is to have a virtual machine layer which handles garbage collection (and thread scheduling and everything else that really wants platform-specific optimization), and then to write the rest of the language implementation targeting that virtual machine.

                            So for example, a virtual machine instruction might be “load object from memory into local variable”, and then while executing this instruction the virtual machine implementation could do things like invoke a read barrier, make a note of any new pointers which are being written into the stack, etc.

                            The “reference implementation” will be self-hosting and portable but if the language is used enough to justify it, more performant platform-specific implementations could be created later, in most cases by re-implementing the virtual machine for each platform. The reason for the reference implementation being self-hosting is to make it easir for developers to understand, modify, and port it.

                            So, the key for me is to understand how to structure the virtual machine in order to make it possible for later platform-specific implementations of it to do whatever they need to do to make garbage collection efficient. This is one reason why I do not want to put off the implementation of an incremental GC until later, because I don’t want to miss a requirement for the structure of a virtual machine (such as read or write barrriers) that may be needed for fancy GCs but not for simple ones. Here’s are the requirements that I have collected so far, can you think of anything I should add to the following list?

                            The runtime may need to be able to:

                            • produce a list of the locations of any pointers within data, including in:
                              • the registers
                              • the stack
                              • the heap
                              • the globals
                            • impose read barriers and write barriers
                            • interrupt execution every now and execute GC code
                            1. 2

                              Thanks, what is the “GC in uncooperative environments” trick – are you referring to https://hboehm.info/spe_gc_paper/preprint.pdf ?

                              Yup, that’s the one.

                              So for example, a virtual machine instruction might be “load object from memory into local variable”, and then while executing this instruction the virtual machine implementation could do things like invoke a read barrier, make a note of any new pointers which are being written into the stack, etc.

                              So then the VM can’t be implemented in the language, because the VM sits explicitly under the language’s type safety. If you can implement the GC in the language then the language must be able to express things like converting between integers and pointers to store mark bits inside pointers and so on.

                              It is possible that you are using the term ‘self-hosting’ in a way that I am unfamiliar with. Can you elaborate a bit?

                              produce a list of the locations of any pointers within data, including in: the registers the stack the heap the globals

                              Or, in simpler terms, to produce a set of GC roots. This is necessary for a tracing GC, it is not necessary for a ref-counting model (even ref counting with cycle detection).

                              impose read barriers and write barriers

                              These are necessary for concurrent or incremental tracing. Slightly different barriers are necessary for reference counting.

                              interrupt execution every now and execute GC code

                              This is necessary for most tracing-based approaches (even mostly concurrent collection typically needs at least to pause each thread).

                              In my experience, there is a huge value in co-designing your memory management model with your language. Trying to produce a language that is agnostic to memory-management policy and an implementation that allows many different policies means that you will make a lot of compromises that won’t become apparent until you are a long way into writing code in the language.

                              1. 1

                                So then the VM can’t be implemented in the language, because the VM sits explicitly under the language’s type safety. If you can implement the GC in the language then the language must be able to express things like converting between integers and pointers to store mark bits inside pointers and so on.

                                It is possible that you are using the term ‘self-hosting’ in a way that I am unfamiliar with. Can you elaborate a bit?

                                Perhaps I shouldn’t have said self-hosting. With regards to garbage collection, my plan is that the reference implementation of the GC will be expressed in terms of operations of a portable virtual machine. This virtual machine will have an opaque pointer representation that will not support things like storing mark bits inside pointers. The reference implementation could instead do things like storing mark bits in separate memory locations next to the pointers. Similarly, it won’t be able to take advantage of things like using the OS’s virtual memory system to trap writes for write barriers; instead, the garbage collector implementation could create a “write-with-write-barrier” function and the convention would be that the rest of the higher-level-language implementation will always use the “write-with-write-barrier” function to write to managed memory.

                                This virtual machine, plus a set of functions such as “write-with-write-barrier” (and similar functions which encapsulate not just the GC, but also stuff like scheduling), forms an abstraction layer. The goal is to write the implementation of the new language’s semantics on top of this abstraction layer. So there are three layers here: (1) the virtual machine; (2) stuff like “write-with-write-barrier”; (3) the language semantics.

                                This reference implementation will be portable and somewhat easy to read, but it will be very inefficient. If efficiency is desired, additional platform-specific implementations of the new language could be written in any other language – for example, in addition to the reference implementation, there could be another implementation of the new language in C on the GNU/Linux platform. If anyone desired to use the language for “real-world use”, they would probably need one of these more efficient platform-specific implementations. One easy way to get started writing a platform-specific implementation will be to (manually) translate the reference implementation of the virtual machine into the target language/platform (e.g. port layers 1 and 2 to C on Linux), and then incrementally improve it to take advantage of platform-specific mechanisms (eg put mark bits inside pointers, use OS paging mechanisms to trap writes). The plan is for these platform-specific improvements to replace layer (2) but leave layer (3) unchanged. This will make it easier to keep the platform-specific implementation in sync with the reference implementation as the language’s semantics are changed over time.

                                The reference implementation will have terrible performance by design, so why do I care what kind of garbage collection it does? My main concern is ensuring that the reference implementation will be structured to include whatever ‘hooks’ are needed by fancy garbage collection algorithms. So, for example, if I later wanted to make a C implementation of my language and use a garbage collector with a write barrier, I want to be able to do that by only modifying layers 1 and 2. But if the reference implementation didn’t have even the concept of a write barrier, then you’d have to overhaul layer 3 to replace all of the stores to managed memory with write-with-write-barrier. I feel like the best way to ensure I haven’t missed anything is to at least familiarize myself with the algorithms that might be needed later. If I want to be extra careful to make sure I haven’t missed anything, then I should actually use a somewhat fancy garbage collection algorithm in the reference implementation.

                                In my experience, there is a huge value in co-designing your memory management model with your language.

                                I’m still early in the designing of both the language and the memory management model. The language will be a scripting language with similar semantics and use-cases as Python, except with less emphasis on reference types and more emphasis on permitting very many threads of parallel execution – I was intrigued by the parallel (bitonic?) sort example in the book The Connection Machine and I would like to explore algorithms where the number of “threads” is similar to the number of data items being manipulated.

                                1. 2

                                  This virtual machine, plus a set of functions such as “write-with-write-barrier” (and similar functions which encapsulate not just the GC, but also stuff like scheduling), forms an abstraction layer. The goal is to write the implementation of the new language’s semantics on top of this abstraction layer. So there are three layers here: (1) the virtual machine; (2) stuff like “write-with-write-barrier”; (3) the language semantics.

                                  My concern here is that you’re explicitly permitting write-without-write-barrier by having this as a layer on top of the core VM, which can then introduce memory safety bugs and means that your language is not memory safe by construction, it is memory safe if you get a load of things right in multiple loosely coupled layers.

                                  I’m still early in the designing of both the language and the memory management model. The language will be a scripting language with similar semantics and use-cases as Python, except with less emphasis on reference types and more emphasis on permitting very many threads of parallel execution – I was intrigued by the parallel (bitonic?) sort example in the book The Connection Machine and I would like to explore algorithms where the number of “threads” is similar to the number of data items being manipulated.

                                  If you’re designing for concurrency then co-designing the memory management model and the language is even more important. Your GC can easily become a bottleneck. Temporal memory safety is a global property and so requires global synchronisation or some language properties that limit it. In Verona, for example, we allocate objects in regions. A region can have an arbitrary object graph inside but only a single pointer from the outside. We use a linear type for that and so we can deallocate an entire region deterministically when that pointer goes away and we can guarantee no concurrent mutation of objects within a region. This means that we can do local GC of any region, independently of the rest of the system (we can also have different memory-management policies in different regions, such as reference counted, traced, or bump-allocated with no deallocation before the end of the region’s lifetime).

                                  For high thread counts (or, more accurately, for high degrees of concurrency), you’re going to see these problems a lot faster than in a language like Python that is mostly single-threaded with some concurrency bolted on as an afterthought. Again in Verona, we have concurrent owners (cowns) that own the root of a tree of regions and we schedule behaviours that run over a set of cowns. A cown may own a single object and so the maximum degree of concurrency is the number of objects. This required us to aggressively co-design the type system and the memory management model.

                                  The simplest way of doing this is to follow an Erlang-like model where there is no mutable data at all. This gives you some very nice properties in a GC’d heap (all references are to older objects and so you can do a single-pass compacting GC) or lets you do atomic reference counting on strongly connected components. If you have shared mutable state then you need to have a memory model (what happens when two threads simultaneously read and write the same field of a shared object) and you need to carefully consider how this interacts with your memory management policy because any decision here will aggressively constrain the space of possible implementations.

                                  1. 1

                                    Thanks for your helpful description of Verona. It’s an interesting approach, and beyond that also it helps to clarify for me what co-designing the memory management model with a language can look like. I’ll read about Verona further.

                                    My concern here is that you’re explicitly permitting write-without-write-barrier by having this as a layer on top of the core VM, which can then introduce memory safety bugs

                                    Yes, i worry about this too. Some possible mitigations would be: create another intermediate layer in which raw writes are replaced by write-with-write-barrier; or, introduce some sort of automated check; or, just try to be extra careful.

                                    1. 2

                                      I’ll read about Verona further.

                                      Happy to answer any questions.

                                      Yes, i worry about this too. Some possible mitigations would be: create another intermediate layer in which raw writes are replaced by write-with-write-barrier; or, introduce some sort of automated check; or, just try to be extra careful.

                                      The Java approach is to have only the safe operations in the bytecode. This is nice, because it means that you cannot express the dangerous thing, but it made it harder to implement things like MMTk. They ended up with the com.java.sun.Unsafe package that lets you bypass all of the rules.

                                      This gets problematic later because any of the invariants that you can guarantee in the language can be used as axioms for defining transforms that are valid for optimisation. If you allow code to break the rules then you can’t rely on the rules holding for optimisation (or you can go the C path and say that breaking the rules is undefined behaviour, but then you don’t have any safety guarantees). In Verona, we say that the language runtime is part of the implementation and so it is allowed to break the rules, but only in ways that the rest of the implementation (specifically, the compiler) knows about. We’ll see if that works.

                          2. 1

                            Regarding a vanilla mark sweep collector, my understanding is that that could have very long pause times?

                            I give a more detailed answer elsewhere, but essentially any non-incremental GC can have long pause times, but “can” have a long pause time is not the same as “will”. JS engines went a long time, and dealt with very large heaps, with stop the world mark/sweep collectors.

                            1. 1

                              Regarding option #2, this post is part of that work.

                              The problem is that there’s really not a one size fits all solution here, and the characteristics and semantics of your language may result in optimizations that work in your situation, but not in others. A great example of this is: One Pass Real-Time Generational Mark-Sweep Garbage Collection, by Armstrong and Virding, that was used in Erlang. Because Erlang is purely functional, this extremely simple collector was viable (it’s since been replaced). A similar algorithm (at least, I think similar, I haven’t read the paper) called “Treadmill” is linked in a sibling thread.

                              Regarding a vanilla mark sweep collector, my understanding is that that could have very long pause times?

                              It’s possible. It just totally depends on how your language executes code, and how memory is utilized in the program, and what the application actually needs to optimize for. If this language doesn’t exist yet, the GC algorithm is the least of your worries, honestly. I’d do the simplest thing and come back to it later.

                              1. 1

                                Thanks, I will take a look at that paper, and I will keep an eye out for any language-specific guarantees that I could make that might help the garbage collector.

                          1. 1

                            For later readers, here’s a summary of things I think I’ve learned from the replies (so far in the first 2 days; and from some further reading/thinking I’ve done in that time:

                            • everyone suggested a tracing garbage collector (GC) of some sort (one suggestion was for reference counting, with a tracing cycle detector done only when pauses can be tolerated)
                            • most commentators feel that my criteria are too stringent, and that it would be better to give up on some of them for now, especially one or more of: portability, concurrency, short pauses, cycle-detection
                              • if portability can be given up, then the Boehm-Dehmer-Weiser collector ( https://www.hboehm.info/gc/ ) can be used
                              • if the language can get by without mutable data, then cycle-detection isn’t necessary
                              • if the language can get by without sharing garbage-collected memory between threads, then concurrency isn’t necessary
                              • i could also consider giving up only the intersection of some of these criteria; for example, i could have concurrency without cycle-detection, and have cycle-detection on memory which is only visible to one thread, but not detect cycles in shared memory
                            • no one suggested a moving (copying or compacting) garbage collector
                            • incremental tricolor mark-sweep tracing GCs were mentioned multiple times, so i should probably search the literature for those (some literature uses the term “real-time” rather than incremental; it’s probably worth it for me to search the literature for both of those terms although i don’t think they are synonymous)
                            • a list of special things that the implementation may need to do in order to permit garbage collection includes:
                              • produce a list of the locations of any pointers within data, including in:
                                • the registers, stack, globals, anything referenced from external/non-managed code (“the roots”)
                                • the heap
                              • read barriers and write barriers
                              • interrupt execution every now and then for GC
                              • override pointer comparisons (so that eg copying collectors can say that a fromspace pointer and a tospace pointer to the corresponding object are equivalent)
                              • derived pointers maybe shouldn’t be stored, not even in registers, b/c moving GCs may move the base pointer, and hence invalidate all derived pointers.
                                • Instead, could use fat pointers that always store the base pointer next to an offset
                                  • still can’t do pointer subtraction though
                              • associate GC metadata (like color) with each pointer (can either be right next to, or in, the pointer (pointer tagging); or in an object header; or in a separate table of GC metadata)
                            1. 1

                              Remember, ref-counting can have arbitrarily long pause times too, when the last reference to a large object graph is cleared and that entire graph gets torn down piece by piece. Some systems tried to work around this with “deferred” ref-counting, where they’d only do limited work during a free and queue the rest for later, but I’ve not heard of this being a problem in anything recent.

                              Current thinking about concurrency leads to threads not sharing memory, so some systems like Erlang have a heap per thread. This means your memory manager doesn’t have to be concurrent (win!) and you can instantly free all of a thread’s allocations when it exits (win!) but any heap object passed from one thread to another has to be copied (lose!) The latter problem can be pretty serious, as I recall from when co-workers of mine were analyzing CouchDB’s performance. The Pony language has an interesting way around this by letting one heap “borrow” an object from another heap.

                              1. 1

                                Thanks, yes, for concurrency, I think not sharing memory is probably the way to go for the most part. It would be a nice bonus to also provide a concurrent garbage-collected memory-sharing mechanism in case the programmer really wants to share some memory, but I suppose that would be a lot of trouble for a mechanism that would only get used now and then. I’ll look into Pony’s borrowing, that sounds interesting.

                              1. 7

                                I would start with a stop the world mark and sweep collector, no compacting, copying, or moving, couple it with tagged pointers to avoid churn of numbers, etc. Then when and if you find the stop the world pauses to be too much of a problem start looking at making it more complicated.

                                I am being serious here, that is how the JS engines worked for more than a decade (for spider monkey it could be two), and the GC pauses were essentially unnoticed. This was at a time when all active web pages were running from a single shared heap as well, and prior to the more aggressive NAN-boxing of modern engines so were taking more of an allocation hit during numeric heavy operations.

                                The need for small pause times is almost entirely as a result of modern JS being used to do a lot at a high frame rate, so historically fine pauses of potentially 20ms are no longer tenable.

                                I would ask very simply, why do you need the gc to be concurrent? “because other engines do it” is not a good answer. Making a concurrent GC is vastly more challenging than single threaded, as you are essentially taking on the complexity of incremental and generational GC.

                                portable, concurrent, short pause times > simple implementation > low memory usage > fast

                                As above, why do you want concurrent? the only reason for concurrent GC is performance, short pause times is related to fast, simple implementation is counter to concurrent, low memory usage means ref counting, as GC cost is proportional to how much of the heap is live - if your heap is filled with live objects you end up having to GC more frequently, and the GC takes longer because time to scan is proportional to how many live objects there are.

                                Why not a vanilla mark-sweep collector? That can have long pause times.

                                This is a problem that is in general way overblown. A long pause time requires a very large heap (pause time is proportional to live heap), generational GCs partially mitigate that at the cost of much higher memory usage, much more complexity, and still can have very large pause times when the heap gets large.

                                The only way to guarantee short pause times is to support incremental GC, which can be built with any of the standard mark/sweep, generational, copying, etc GCs because being incremental is independent of the tracing and collection strategy. It is also very hard to implement.

                                Mark/sweep collectors also easily extend to generational collection if you feel that your GC is suffering from scanning large amounts of long lived objects.

                                But again, JS engines uses stop the world GCs for years, in principle you can make a GC pause for a very large amount of time. In practice that doesn’t occur very often, Blocking the construction of a GC on the basis of a potential problem, or choosing a complex implementation over a simple one, is imo not a great choice.

                                Assuming a good design, your interface to your GC should not expose the actual implementation, so changing in future should not be more complex than implementing the complicated thing first. The only real problem is write barriers, but that’s a large part of the complexity of every GC other than mark sweep.

                                1. 1

                                  long pause times… a problem that is in general way overblown… if you find the stop the world pauses to be too much of a problem start looking at making it more complicated.

                                  Thank you for your long and detailed answer. It could be true that i am worrying too much about pauses. My worries are probably because in the distant past, I recall hearing about people’s bad experiences with Java GC pause times. Perhaps those experiences were just outliers.

                                  why do you need the gc to be concurrent?

                                  One goal of this language is to be suitable for writing programs with very large numbers of threads executing in parallel.

                                  your interface to your GC should not expose the actual implementation, so changing in future should not be more complex than implementing the complicated thing first

                                  Yes I agree, I am trying to future-proof the interface to the GC. This is one reason why I do not want to put off the implementation of an incremental GC until later, because I don’t want to miss a requirement for this interface (such as read or write barrriers) that may be needed for fancy GCs but not for simple ones. Here’s what I have so far, can you think of anything I should add to the following list? The runtime may need to be able to:

                                  • produce a list of the locations of any pointers within data, including in:
                                    • the registers
                                    • the stack
                                    • the heap
                                    • the globals
                                  • impose read barriers and write barriers
                                  • interrupt execution every now and execute GC code
                                  1. 1

                                    I recall hearing about people’s bad experiences with Java GC pause times

                                    In my experience long pause times are rarely a cause, they’re an effect of poor algorithmic / data structure choices. The specific GCs that are bundled with OpenJDK are are pretty well tuned, and the default (G1) is concurrent and generational.

                                1. 3

                                  I would recommend giving reference counting a try. You don’t need a lot of code, it is conceptually simple to understand and avoids GC pauses unless you have excessively deep chains of garbage data. In practice the performance impact is nearly “real time” (but not really real time). Data can remain uncompacted if you have a single block size (or several pools of differently sized blocks). Circular references will need to have a solution (unless your language is side-effect free), which can be done by tracing all data at “safe” points, where the cost of traversing all data is acceptable.

                                  The one tricky thing is that you have to be very careful to keep track of objects during the execution time of your code, as memory leaks are easy to produce, especially in compiler-generated code. One nice property of RC is that you free memory very early and it is straightforward to attach resource-cleanup operations (“finalizers”). You can implement various schemes to manage total heap size on top of that (another criterion: do you want your system to have a fixed heap or a growing/shrinking one? What about being able to sustain peak loads?)

                                  But in the end every automatic memory management scheme gets complicated when integrated into a non-trivial runtime system - all have their advantages and disadvantages. Mark/Sweep without compaction is probably the simplest possible approach, but you will have pauses. These pauses (depending on heap size) can be so short that they are not noticable in “normal” applications, but once you have latency issues (e.g. games), they become a real problem, even if short.

                                  1. 1

                                    Thanks, I will consider reference counting, with the addition of a tracing GC for detecting cycles when a pause is acceptable. Several pools of differently-sized blocks sounds like a good idea. To answer your question, I’d like a growing heap. A shrinking heap would be nice but is not essential.

                                  1. 2

                                    In addition to lots of good advice already given, if you decide to implement your own GC, I recommend the micropython GC as a “good enough and simple” reference, and of course the toy one I wrote, which is conservative, incremental, tracing, not compacting, but not concurrent: https://github.com/robey/mwgc . If you want to write the GC in your own language, you’ll probably end up needing to deeply understand the code anyway, which means writing it yourself. :)

                                    1. 1

                                      Thanks, I will check out your GC and the Micropython one. I read https://github.com/robey/mwgc#how-it-works just now, it looks great; it seems simple and meets almost all of my criteria. It seems like the sort of thing that I am after.

                                      Would it be fair to characterize your GC as a conservative, incremental, not compacting tricolor mark-sweep tracing garbage collector with a write barrier?

                                      1. 2

                                        Yes, exactly – with the caveat that the write barrier is required but not implemented. :)

                                    1. 3

                                      I think what would formally tick all the boxes is:

                                      • strict FP language which prevents cycles from arising without extra restrictions on programming model (eg, elm or roc)
                                      • atomic ref-counts
                                      • deferred freeing of objects: when ref count reaches zero, just toss the object onto garbage list and have a background thread to do actual cleaning and recursive deallocation
                                      1. 1

                                        Thanks, in my context I think constraining the programmer to never create cycles is a tall order, but I’ll consider it.

                                      1. 6

                                        The “treadmill” garbage collection technique was posted as a topic right here on lobste.rs not that long ago. I don’t have the original link but here is a brief paper (which may be a bit light on detail, but you should be able to find more): https://dl.acm.org/doi/pdf/10.1145/130854.130862

                                        It may fit the bill for you - seems simple enough, usable in “realtime” scenarios so pause times should be limited (you can control the maximum pause by tweaking the algorithm, bearing in mind that there will always be a trade-off between low pauses and total throughput).

                                        1. 1

                                          Thanks, I read that and it seems simple and meets almost all of my criteria (not sure about concurrent, I’ll have to think about that).

                                          It seems like the key is that there is an always-on read barrier which allows the garbage collector to detect when the mutator accesses something that the garbage collector otherwise might have thought was unused. That sounds like the sort of tradeoff of performance for simplicity that I am after.

                                        1. 44

                                          This article contains a lot of errors, I’m going to just start with the first one:

                                          No, it isn’t. It’s an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all

                                          This is true only in the uncontended case. The problem with reference counting is that it turns reads into writes. On a modern CPU, an atomic in the local cache is very cheap but an atomic in a remote cache line can be expensive and this cost goes up with larger core counts. An immutable (or, not mutated) object can be in every core’s cache simultaneously in shared state but its reference count will ping-pong between all of the cores that are incrementing and decrementing it. If the refcount is in the object header, this will invalidate part of the object in other cores, causing cache misses even after the object had been acquired. If the refcount is elsewhere then you now suffer from locality problems. You can avoid this with deferred reference counting schemes, but then you lose the deterministic deallocation of reference counting.

                                          Both reference counting and tracing have poor performance in certain situations and it’s important to understand them if you choose one or the other. It’s also important to read the unified theory of GC paper that was recently posted here to understand the degree to which most modern automatic memory management schemes are a hybrid of the two.

                                          1. 10

                                            Supposedly the Apple Silicon ARM systems have some secret sauce that makes atomic operations extra fast — someone noticed this in a microbenchmark soon after the M1 came out. Given how reliant Apple systems are on ref-counting, this makes sense.

                                            1. 8

                                              I have no knowledge of how Apple does this, but as someone who works on hardware-language co-design, this is what I’d do:

                                              Apple stores the refcount (on the fast path) in some of the spare bits in the isa pointer in Objective-C. This means that it will be in the same cache line as part of the object. In addition, the semantics of Objective-C and Swift mean that a decref (release) will not be followed by any access to the object (other than, potentially, to deallocate it completely) and an incref (retain) will always be on an object that is not being concurrently freed (if it is, it’s UB and it’s fine to crash). This leads to a possible cache-coherency protocol that can make this fast:

                                              You want to perform ARMv8.3 atomic compare and exchange on the remote cache line. Rather than doing a broadcast invalidate, move to exclusive, and cmpxchg, you instead do send a cmpxchg message to the owning core and require that core’s cache to do the exchange and return the result.

                                              For this to work, you need to have an extra shared-but-owned state in your state machine. When a cache line in shared state is copied to another core, you ensure that one is in shared state and one is in shared-but-owned state. The exact best way to do this depends on your NoC architecture. If you have guaranteed bounded latency for cache-coherency messages then you can do some optimisations here.

                                              Now, when you do the atomic compare and swap, you send a broadcast message[1] to do the update. The lower-layer inclusive cache snoops on this and, if the line isn’t in any L1s, handles it directly, including doing a fetch from memory. If it is handled in the L2/L3 then, as an optimisation, you could also send the line in shared-but-owned state to the L1 of the requesting core: if it’s done a refcount manipulation and hadn’t accessed the object then there’s a good chance that it will again soon.

                                              The weak memory model on Arm lets you do a lot of other memory accesses in parallel with this. In particular, non-atomic reads of the same cache line are not ordered with respect to the atomic write in the absence of another explicit barrier and so you’re free to keep the stale version of the data in the shared-state caches until they receive the broadcast message to update it.

                                              It’s possible that you can do an atomic increment for retain, rather than a compare-and-swap, depending on which bits they use (I think they use the high bits?), which means that a retain then has a (predicted-not-taken) branch-on-overflow following it and you just speculatively execute everything after it and roll back in the very rare case of overflow (at which point you’re acquiring a lock and storing the refcount in a separate table anyway).

                                              Deallocate would also be a slow path, but if you do a little bit of value prediction to feed your branch predictor then you can probably improve this. If you have the released object in your cache line, then you predict the branch to the deallocator based on what a decrement on the local copy’s refcount would be. If two threads drop a refcount from 2 simultaneously, then one will be mispredicted, but otherwise this will always predict correctly. If you race a decref with an incref (UB anyway) then you’ll also see some mispredictions sometimes but that’s UB in the language anyway.

                                              [1] If you have a cache directory or similar then this can actually be a point-to-point message.

                                              1. 1

                                                I don’t quite follow. If some object is owned by another core, presumably that core is using the object, likely twiddling its reference count. So by the time your cas request arrives, there’s a good chance the reference count will have changed already, and you could fail repeatedly.

                                                1. 1

                                                  I think you can use atomic increment/decrement here, which can’t fail. Not sure why the example was cas.

                                                  1. 2

                                                    It’s difficult to use atomic inc/dec here because the reference count bits are stored in only some of the bits of an object and you have to handle overflow / underflow in the presence of concurrent mutation. You might be able to use the almost-top bits in a pointer, so that overflow doesn’t affect the other bits, but then you have some fun corner cases.

                                                    For retain it isn’t too bad. When you hit the overflow path, then you can store a slow-path refcount out of band and use the in-pointer value as the low bits.

                                                    This adds some complexity to release though, because now you need to handle the case that the refcount dropped to 0 but the pointer doesn’t store the whole refcount. Ideally, you’d just call dealloc as soon as this happens (after checking the bit that indicates that there are no weak references) but now you need to acquire a lock to check the external reference count.

                                                    From reading a random blog, it looks as if Apple stores the refcount in the top bits and uses the bit below to indicate that there’s an external reference count. I don’t know how they set that bit atomically with respect to other operations. It’s easy if you use a CAS, because you can just say ‘am I adding to 0xff, if so acquire a lock, increment the external refcount, and set the refcount to 0’, but with an atomic increment you can hit the sequence of:

                                                    1. Thread 1 does a retain and overflows the extra retain count to 0.
                                                    2. Thread 2 does a retain and increments the extra retain count to 1.
                                                    3. Thread 3 does a release and decrements the extra retain count to 0.
                                                    4. Thread 2 does a release and underflow the extra retain count, triggering dealloc.
                                                    5. Thread 1 sets the bit in the pointer to indicate that there’s an overflow retain count.

                                                    I might be missing something here, but I don’t know how you can avoid this kind of problem without a CAS on at least one path.

                                                  2. 1

                                                    If some object is owned by another core, presumably that core is using the object, likely twiddling its reference count

                                                    Hopefully not. The cost of one round-trip between cores is on the order of tens of cycles. If you’re doing only this much work in between retaining and releasing an object then you may be better off using a value type and copying it.

                                                2. 1

                                                  Yeah, it is really amazing. I have done some tests and atomic refcounts on my M1 totally blow my X86 machine out of the water. On top of that, my M1 tends to crush my X86 machine on anything memory intensive due to how much less delay it’s main memory has. Specs wise, the x86 machine should be way faster than the M1.

                                                  1. 1

                                                    Main memory has a similar latency to standard x86 parts. Probably what you’re observing is a combination of great memory bandwidth and really big ROB which can do a better job of hiding the latency.

                                                    1. 1

                                                      No, it’s not memory bandwidth/speed. It’s specifically faster at uncontended atomic operations because the apple hardware does something fancy - presumably cpu word soup involving “look aside”, “bus”, and other words/phrases I am aware of but know no actual useful specifics ()

                                                      The gist of it is that an uncontended atomic operation is very fast, and in the contended case it regresses to somewhere in the realm that you would traditionally expect - I would guess it’s something that intel could do, but hasn’t had reason to, whereas apple’s OS APIs are built around objects with atomic refcount so presumably it is more valuable for Macs specifically than other platforms? (https://twitter.com/Catfish_Man/status/1326238434235568128?s=20)

                                                      1. 2

                                                        Oh, for atomic ops, sure. But the parent said:

                                                        On top of that, my M1 tends to crush my X86 machine on anything memory intensive due to how much less delay it’s main memory has

                                                        which is what I was responding to.

                                                        Additionally:

                                                        I would guess it’s something that intel could do, but hasn’t had reason to

                                                        No need to guess! There is a direct analogue in x86 land.

                                                        x86 mandates tso architecturally, the result of which is that all stores are ‘atomic’ (=strongly ordered), the result of which is that uncontended ‘atomic’ stores are free on intel, since they are just regular stores. On other architectures, it’s frequently the case that regular stores are weakly ordered, so strongly ordered stores are expensive, even when uncontended.

                                                        1. 1

                                                          which is what I was responding to. ah, sorry I misunderstood :-O

                                                          x86 mandates tso architecturally oh boy do I know this

                                                          the result of which is that all stores are ‘atomic’ (=strongly ordered), the result of which is that uncontended ‘atomic’ stores are free on intel

                                                          That’s not been my experience, but I haven’t really dealt with such on intel hardware more recent than 2019 I think? (certainly that’s the vintage of my current laptop) is this a newer hardware thing? My experience has been for at least the types of atomic operations I was using uncontended operations were significantly slower, and closer to the performance of contended. Could this be r/w/m vs store only?

                                                          1. 2

                                                            r/w/m vs store only?

                                                            Yes, that’s what I meant—I guess I was unclear. Uncontended rmw is fast on m1 because it has to be; uncontended store is fast on x86 because it has to be.

                                                            1. 1

                                                              oh boy do I know this

                                                              Curious what the context is here. Were you in charge of making an x86 implementation? Or had to port an x86 app elsewhere and fix everything that was not specified strongly enough in the source?

                                                        2. 1

                                                          I thought main memory was integrated onto the chip instead of being on a separate ram stick which reduced latency. Is that not correct?

                                                          1. 1

                                                            That is my understanding - they get to have obscene amounts of bandwidth, and also remove a pile of latency due to not needing to deal with the same capacitance and leakage issues that separate chips, or worse pluggable dimms - but perhaps @Moonchild has more accurate information (based on their detailed comments elsewhere)?

                                                            1. 1

                                                              This graph from anandtech seems to show ~110ns main memory latency, which is also about what I would expect on an intel.

                                                              My understanding, though I am not a hardware person, is that the bulk of the latency of dram comes from the storage medium itself, so moving it closer doesn’t help much.

                                                        3. 1

                                                          I believe that’s in the uncontended case - I think in the contended case it’s theoretically impossible to not have terrible performance. But my last time dealing with theory of concurrency multiple cores were not a thing :D

                                                        4. 7

                                                          Indeed the poster child for reference counting was Objective-C, in which every class object has a reference count and from which we get ARC. Apple would say it’s more predictably performant than GC, but that didn’t mean it was free. In its successor language Swift, Apple specifically designed to reduce reference count writes—“retain/release traffic” as they call it—by relying much more on value types and copy-on-write reference object backing stores. Today when you write a new type in Swift, you choose whether it will be the reference-counted kind or the copied-value kind. If you are mindful of performance, retain/release traffic should figure into your decision. My favorite discussion of the tradeoffs is Understanding Swift Performance.

                                                          1. 1

                                                            Yeah, there are costs to GC, but this article just seemed very wrong to me. I just didn’t feel like doing numbers on current hardware to verify before complaining :)

                                                            My direct experience with swift is with a multithreaded ray tracer (writing raytracers is how I learn languages), and ref counting overhead was about the single largest cost. To the extent that I was having to spend all my time on optimizing by using the various unsafe internal functions to circumvent the ref counting. Maybe it’s different without such heavy contention, but that was my experience.

                                                            This kind of jibes with my experience trying to make refcounting in WebKit’s stringimpl atomic, and even just that showed up as a measurable (>1%) performance regression across the whole browser benchmarks, not even micro benchmarks.

                                                            But in the end I think that the post reflected a failure to recognize the real perf cost because it’s generally hard to see the death by a thousand cuts nature of it, whereas scanning GCs can have bad cases that can trivially show up as a noticeable pause.

                                                            Alternatively they have experience from the GC in 90/2000s Java and are assuming nothing has improved since.

                                                            1. 2

                                                              The Project Maxwell technical report from Sun Labs had the best description of the tradeoff that I’ve read. Any memory management scheme is a five-dimensional trade-off between:

                                                              • Throughput.
                                                              • Latency.
                                                              • Determinism.
                                                              • Memory overhead.
                                                              • Programmer effort.

                                                              If you want to optimise for throughput then a stop-the-world GC often performs best because there’s no latency or programmer effort overhead outside of the stop-the-world phase and the collector doesn’t have to have to worry about concurrent mutation when it runs and so can be very fast. They gave an example of a stop-the-world GC that ran once per day as an extreme example. iOS uses the highest-throughput policy of all: kill the process and restart. This doesn’t need to do any graph walking, it just reclaims and zeroes all of the pages assigned to a process and lets it allocate again from the start.

                                                              1. 2

                                                                Yeah, when working on migrating JSC’s GC from being stop-the-world avoiding overall perf regressions caused by pretty much everything involved in not being stop-the-world required a huge amount of effort.

                                                                1. 1

                                                                  Do you happen to have a link or citation of the Project Maxwell technical report from Sun Labs? Thanks

                                                                  1. 2

                                                                    It used to be on the Sun Labs website, but I think it was removed. I have a copy that someone from there shared with me, but I don’t know if I’m allowed to redistribute it.

                                                                    The project was one of my favourite hardware/software co-design projects and it’s a shame that internal politics between the Rock and Niagara projects prevented it from ever being merged into SPARC. They observed that they have a young GC generation that is full of recently-accessed short-lived objects and a cache that is full of recently-accessed objects, so they did in-cache hardware GC in the cache and provided barriers for objects that escape to be tracked and handled by a software GC.

                                                                    1. 1

                                                                      I found https://dl.acm.org/doi/10.1145/1168054.1168056, which can be accessed via a certain raven-clad website. I don’t know if that is ‘the’ technical report; @david_chisnall ?

                                                                      1. 2

                                                                        No, looks like an unrelated Project Maxwell. The one that I’m thinking of was a spiritual successor to Manchester’s Mushroom project in the ’80s, trying to do hardware-assisted garbage collection on modern multicore SPARC hardware.

                                                                        1. 2

                                                                          How about this? It is from sun, and describes a hardware-assisted gc (including the in-cache collection mechanism you mention), and cites a ‘mushroom project’. But it does not seem to include any description of tradeoffs in memory management schemes.

                                                                          1. 2

                                                                            That’s definitely the project, but I think that’s the one paper that they wrote about it. They also had a much longer technical report, which I haven’t seen on the Web anywhere.

                                                              1. 4

                                                                Enjoying the nice weather, and probably continuing Type Checker Rewrite N of N+1.

                                                                1. 1

                                                                  Cool! Type checker for what language?

                                                                  1. 1

                                                                    My “what if Rust was small” language, Garnet. Type checking in general has been an exercise in frustration and “here’s all the things textbooks don’t teach you”, and I haven’t even gotten to borrow checking yet. I thiiiiink I’m getting a handle on it though.

                                                                    1. 2

                                                                      If you ever feel like writing it, it would be cool to read a “here’s all the things textbooks don’t teach you” regarding type checking.

                                                                      1. 1

                                                                        Thanks, but I am really not qualified to write it. XD

                                                                        1. 1

                                                                          That’s why you are the one whose “…the things textbooks don’t teach you” I would like to read. Someone who is more qualified would probably forget to explain basic knowledge that I lack.

                                                                1. 2

                                                                  The goal is to cover the entire RV64 user-level instruction set […] RV32 is currently out of scope, but might be nice in the future.

                                                                  Isn’t RV64 an extension of RV32, with some additional instructions for larger integers? I feel like you’d have most of the RV32I, at least, by virtue of having done RV64 in the first place.

                                                                  Also I am not so familiar with Sourcehut – where is the actual draft guide?

                                                                  1. 2

                                                                    Isn’t RV64 an extension of RV32, with some additional instructions for larger integers? I feel like you’d have most of the RV32I, at least, by virtue of having done RV64 in the first place.

                                                                    At first glance, yes. But at second glance, this is not quite true, or at least while it may be true in practice it is not quite a strict extension in principle. From some of the commentary in the spec, section 1.3:

                                                                    The four base ISAs in RISC-V are treated as distinct base ISAs. A common question is why is there not a single ISA, and in particular, why is RV32I not a strict subset of RV64I?

                                                                    …[reasons]…

                                                                    A related question is why there is a different encoding for 32-bit adds in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for 32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the existing design which uses the same opcode ADD for 32- bit adds in RV32I and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in RV64I. This would also be more consistent with the use of the same LW opcode for 32-bit load in both RV32I and RV64I. The very first versions of RISC-V ISA did have a variant of this alternate design, but the RISC-V design was changed to the current choice in January 2011. Our focus was on supporting 32-bit integers in the 64-bit ISA not on providing compatibility with the 32-bit ISA, and the motivation was to remove the asymmetry that arose from having not all opcodes in RV32I have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was perhaps not well-justified and a consequence of designing both ISAs at the same time as opposed to adding one later to sit on top of another, and also from a belief we had to fold platform requirements into the ISA spec which would imply that all the RV32I instructions would have been required in RV64I.

                                                                    Basically, teasing out all the details that may differ is more work than I want to do for now. Plus a spec for RV64 will be 99% valid for RV32 anyway, so I don’t see a great benefit to separating the two. I’m more interested in phone/desktop-class RISC-V stuff than embedded right now, though I confess I may be in the minority there.

                                                                    1. 2

                                                                      Also I am not so familiar with Sourcehut – where is the actual draft guide?

                                                                      i’m not sure but i think it’s here: https://hg.sr.ht/~icefox/riscv-reference/browse/instructions.toml

                                                                    1. 2

                                                                      It saddens me when open source projects/languages move to a proprietary solution for community support. It makes me reluctant to sign up and I worry that the project not owning the data will become a problem later in the future. I feel similarly about Julia, which has the same approach. Maybe I just don’t understand the advantage of going into that direction. Also those websites do not work without Javascript, which I don’t understand why that is a requirement.

                                                                      1. 10

                                                                        I think this is using the Discourse discussion platform, which is open source: https://github.com/discourse/discourse

                                                                        1. 2

                                                                          Well, now my statement makes a lot less sense, for some reason I thought this was the same as disqus. Thank you for pointing it out!

                                                                        2. 6

                                                                          Well, the main community links are to the mailing lists: https://lists.racket-lang.org/

                                                                          Understandably many people prefer Discourse to mailing lists though.

                                                                          Also those websites do not work without Javascript,

                                                                          Discourse instances usually also support a mailing list mode.

                                                                          1. 2

                                                                            I did not know this, too. Although I don’t get why the text content cannot be rendered in a browser without Javascript it’s nice to know there is at least some alternative.

                                                                        1. 5

                                                                          There is SWEET16 but that was used for actual serious work so maybe it doesn’t meet your criteria.

                                                                          There are plenty of One Instruction Set Computers to choose from. Subleq seems to be the design most often discussed.

                                                                          I wanted to link to some documentation of The Pinky Processor but all I could find was a post that briefly mentions it. As far as I remember the interesting thing about the design was that it was addressing granularity was not based on bytes like we’re used to but was instead done by bits. You could specify any bit in memory with a single address and there were no alignment requirements so you could put a 4-bit field followed immediately by a 13-bit field, followed immediately by a 5-bit field, and so on. If anybody can find the original specification document I’d love to see it.

                                                                          Edited to add: The Gray-1: a computer made from just memory chips.

                                                                          1. 2

                                                                            Each and every link has brought a smile; I especially like the Gray-1 computer.

                                                                            1. 1

                                                                              Here’s some Pinky Processor stuff: https://github.com/tslug/pinky/

                                                                            1. 1

                                                                              I think one thing that makes Perl hard to read is Perl’s idea of list/scalar context, which I believe is a form of ad-hoc polymorphism. jbert posted this link introducing scalar/list/void context in Perl: https://www.perlmonks.org/?node_id=738558

                                                                              The article makes it sound like sigils are just clarifying what type things are, and wouldn’t that improve readability? But because of context, sigils in Perl aren’t just like type annotations that help clarify and double-check what’s already going on, rather they are actually choosing which of the overloaded meanings of the annotated expression is being run.

                                                                              Some other programming languages have ad-hoc polymorphism based on how the value of an expression is later used, as well. In my opinion this can be powerful but can make things less readable; you can no longer understand code by “single-stepping” with your eyes from the first thing computed to the last.

                                                                              1. 1

                                                                                But because of context, sigils in Perl aren’t just like type annotations that help clarify and double-check what’s already going on, rather they are actually choosing which of the overloaded meanings of the annotated expression is being run.

                                                                                I like that description. They don’t describe the variable at all, since you can change sigils on foo at will. They’re closer to casting an expression.

                                                                              1. 3

                                                                                More papers, and comments on which ones to read, at https://www.cs.tufts.edu/~nr/c--/papers.html

                                                                                I believe this is the same as the ‘Papers’ section of the archived C– website that adamo pointed to.

                                                                                1. 17

                                                                                  I think this mini-renaissance of Ada awareness is in part due to the “mass discovery” of safe systems programming led by Rust.

                                                                                  I like everything I’ve read about Ada, and I think SPARK is its killer feature. That being said, in the realm of “safe systems programming” I don’t think Ada’s going to be able to withstand the onslaught of Rust – Rust has the modern tooling, the ecosystem, and all the momentum.

                                                                                  (This pains me to say because I’m not the biggest Rust fan and I really like what I’ve seen of Ada, but if I’m going to learn one strict language, it seems like the best one career-wise would be Rust. I’d love to be proven wrong, though; the world needs to ring with the keyboards of a thousand languages.)

                                                                                  1. 11

                                                                                    Ada’s been flying under the radar doing it’s own thing, until AdaCore started open sourcing things and pushing awareness a few years ago. GNAT is actually a mature ecosystem (gpr, gnat studio, gnatdoc, gnattest, gnatpretty, etc) and has libraries which go back decades, it’s just not well known. The issue is that there had been no unifying system for distributing libraries before, which has hampered things, but Alire is fixing that now and also streamlining working in the language as well.

                                                                                    Rust is doing well for good reason. It has an easy way to start, build, test and share things, which is exactly why Alire is based on the work that community (and the npm community) has done. The big philosophical difference I’ve found between writing Rust and Ada is that Rust focuses on traits/types, whereas Ada focuses on functions since it relies predominantly on function overloading and separation into packages (modules), though you can use ML-style signatures. It’s interesting because in Ada, in general (except for tasks/protected objects), types aren’t namespaces for functions, whereas Rust relies on this with traits. In this way, Ada feels more to me like a functional language like Haskell and Rust feels more like an OOP language like Java. I’ve always found this bizarre because it should be the opposite.

                                                                                    Ada’s big advantage is how its general conceptual models are much more similar to C or C++ (think type safe C with function overriding, with classes if you want them), but it builds in a much stronger type system and concurrency types as well. Built-in pre/postcondition, invariants, and type range checks also goes a lot towards correctness, even if you don’t want to fully commit to format verification with SPARK. Note also that SPARK isn’t an all or nothing thing either with Ada code, you can verify individual components.

                                                                                    1. 7

                                                                                      While I’m generally a huge user of Rust and a proponent of it for many CPU-bound workloads, Rust really doesn’t have much in the way of safety features except when you compare it strictly to C and C++. Rust has a lot of antipatterns that the ecosystem has broadly adopted around error handling and error-prone techniques in general, and the overall complexity of the language prevents a lot of subtle issues from being cheap to find in code review. Building databases in Rust is not all that different from building them in C++, except there is a bit narrower of a search scope when I discover memory corruption bugs during development.

                                                                                      Ada (and much more so, SPARK) are a completely different tool for a completely different class of problems. The proof functionality of why3 that you gain access to in SPARK has no usable analogue in Rust. The ability to turn off broad swaths of language features is one of the most powerful tools for reducing the cost to build confidence in implementations for safety critical domains. There are so many Rust features I personally ban while working on correctness-critical projects (async, most usage of Drop, lots of logical patterns that have nothing to do with memory safety directly etc…, many patterns for working with files that must be excruciatingly modeled and tested to gain confidence, etc…) but with Ada it’s so much easier to just flip off a bunch of problematic features using profiles.

                                                                                      Ada SPARK faces zero competition from Rust for real safety critical work. It is simply so much cheaper to build confidence in implementations than Rust. I love Rust for building things that run on commodity servers, but if I were building avionics etc… I think it would be a poor business choice TBH.

                                                                                      1. 3

                                                                                        The ability to turn off broad swaths of language features is one of the most powerful tools for reducing the cost to build confidence in implementations for safety critical domains

                                                                                        This is something I never thought about, but yeah, Ada is the only language I’ve even heard of that lets you globally disable features.

                                                                                        1. 2

                                                                                          antipatterns that the ecosystem has broadly adopted around error handling and error-prone techniques […]

                                                                                          What kind of antipatterns are you thinking of? A lot of people appreciate the Result type and the use of values instead of hidden control flow for error handling.

                                                                                          1. 3

                                                                                            Try backed by Into is far, far worse than Java’s checked exceptions in terms of modeling error handling responsibilities. I’ve written about its risks here. Async increases bugs significantly while also increasing compile times and degrading performance to unacceptable degrees that make me basically never want to rely on broad swaths of essentially untested networking libraries in the ecosystem. It makes all of the metrics that matter to me worse. There are a lot of things that I don’t have the energy to write about anymore, I just use my own stuff that I trust and increasingly avoid relying on anything other than the compiler and subsets of the standard library.

                                                                                            1. 1

                                                                                              I think your concerns are valid, but it’s basically like saying “don’t write int foo() throws Exception” in java, which would be the equivalent.

                                                                                              There is a strong tendency in the Rust community to throw all of our errors into a single global error enum.

                                                                                              citation needed there, I guess? Each library does tend to have its own error type, which means you’re going to have to compose it somewhere, but it’s not like everyone returns Box<dyn Error>, which would be the single global error type (although that’s what anyhow is for applications). The right balance should be that functions return a very precise error type, but that there’s Into implementations into a global foo::Error (for library foo) so that if you don’t care about which errors foo operations raise, you can abstract that. If you do care then don’t upcast.

                                                                                              I think async is overused, but claiming it degrades performance is again [citation needed]?

                                                                                          2. 1

                                                                                            antipatterns that the ecosystem has broadly adopted around error handling and error-prone techniques in general

                                                                                            I’m interested to hear some of these problematic antipatterns around error handling and error-prone techniques in general?

                                                                                          3. 3

                                                                                            You’re right about the tooling, ecosystem and momentum, but Rust is a very long way from provability. I wonder if we’ll ever see something SPARK-like but for a subset of Rust.

                                                                                            1. 4

                                                                                              The question is how much provability matters. My guess is not a whole lot in most cases.

                                                                                              I feel like Rust’s momentum is largely a combination of, “C and C++ have too many footguns,” “Mutable shared state is hard to get right in a highly concurrent program,” and, “Everyone is talking about this thing, may as well use it for myself.”

                                                                                              1. 5

                                                                                                Provability is starting to matter more and more. There are very big companies getting involved in that space - Nvidia has started using Spark, Microsoft has been funding work on things like Dafny and F* for a long time, Amazon uses TLA+… Provability is just getting started :).

                                                                                                1. 3

                                                                                                  I think provability definitely has some applications, largely in the space that Ada is used for currently – safety-critical systems mostly. I want to be able to prove that the software in between my car’s brake pedal and my actual brakes is correct, for example.

                                                                                                  1. 18

                                                                                                    I want to be able to prove that the software in between my car’s brake pedal and my actual brakes is correct, for example.

                                                                                                    I think we’ve shown pretty conclusively you can’t solve the car halting problem.

                                                                                                    1. 17

                                                                                                      I want to be able to prove that the software in between my car’s brake pedal and my actual brakes is correct, for example.

                                                                                                      If you’re going to use 11,253 read/write global variables (yes, eleven thousand, not a typo) in your car software with very limited and broken failsafes against bit flips and other random faults while intentionally ignoring OS errors, then you’re going to run in to problems no matter which tool you use.

                                                                                                      Just running codesonar on the Toyota software resulted in:

                                                                                                      • 2272 - global variable declared with different types
                                                                                                      • 333 - cast alters value
                                                                                                      • 99 - condition contains side-effect
                                                                                                      • 64 - multiple declaration of a global
                                                                                                      • 22 - uninitialized variable

                                                                                                      The throttle function was 1300 lines of code. With no clear tests.

                                                                                                      These issues are only partly technical problems IMO; something like Ada can help with those, but the far bigger problem isn’t technical. People choose to write software like this. Toyota choose to not take the reports seriously. They choose to lie about all sorts of things. They choose not to fix any of this. The US gov’t choose to not have any safety standards at all. Regulators choose not to take action until there really wasn’t any other option.

                                                                                                      And then it took a four-year investigation and an entire team of experts to independently verify the code before action finally got taken. You didn’t need any investigation for any of this. You just needed a single software developer with a grain of sense and responsibility to look at this for a few days to come to the conclusion that it was a horribly unacceptable mess.

                                                                                                      (And yes, you do need a proper investigation for legal accountability and burden of proof, and more in-depth analysis of what exactly needs fixing, but the problems were blatantly obvious.)

                                                                                                      I’d wager all of this would have failed with Ada as well. There weren’t even any specifications for large parts of the code (and other specifications didn’t match the code), there wasn’t even a bug tracker. If you’re going to cowboy code this kind of software then you’re asking for trouble. Do you think that Ada, TLA+, or SPARK would have fared better in this kind of culture? At the end of the day a tool is only as good as your willingness to use it correctly. Software people tend to think in terms of software to solve these problems. But if you look a bit beyond all the software bugs and WTFs, the core problem wasn’t really with the language they used as such. They knew it was faulty already.

                                                                                                      Does provability matter? I guess? But only if you’re willing to invest in it, and it seems abundantly clear that willingness wasn’t there; they willingly wrote bad software. It’s like trying to solve the VW emission scandal by using a safer programming language or TLA+. That’s not going to stop people from writing software to cheat the regulatory tests.

                                                                                                      It’s kind of amazing how few faults there were, because in spite of the many problems it did work, mostly, and with some extra care it probably would have always worked. More guarantees are always good, but to me this seems to indicate that maybe provability isn’t necessarily that critical (although it can probably replace parts of the current verification tools/framework, but I’m not so sure if it has a lot of additional value beyond economics).

                                                                                                      Summary of the reports: one, two.

                                                                                                      1. 1

                                                                                                        Does provability matter? I guess?

                                                                                                        Great, I’m glad we agree.

                                                                                                        More seriously: sure, culture is a problem; I certainly wasn’t trying to discount it.

                                                                                                        But this is an AND situation: to have proved-correct software you need a culture that values it AND the technology to support it. Writing off provability as a technology because the culture isn’t there yet is nonsensical. It actively undermines trying to get to the desirable state of proved-correct safety-critical software.

                                                                                                        I appreciate you wanted to have a rant about software culture, and believe me, I love to get angry about it too. You’re just aiming at something other than what I said.

                                                                                                        1. 1

                                                                                                          I appreciate you wanted to have a rant about software culture, and believe me, I love to get angry about it too. You’re just aiming at something other than what I said.

                                                                                                          It wasn’t intended as a rant, it was to demonstrate that at Toyota things were really bad and that this is why it failed at Toyota specifically. This wasn’t about “software culture”, it was about Toyota.

                                                                                                          Writing off provability as a technology because the culture isn’t there yet is nonsensical. It actively undermines trying to get to the desirable state of proved-correct safety-critical software.

                                                                                                          Of course you need to consider the actual reasons it failed before considering what the solution might be. Overall, the existing systems and procedures work pretty well, when followed. When is the last time your accelerator got stuck due to a software bug? It’s when you choose to not follow them that things go wrong. If Toyota has just followed their own procedures we wouldn’t talking about this now.

                                                                                                          Great, I’m glad we agree.

                                                                                                          I don’t think we agree at all, as I don’t think that provability will increase the average quality of this kind of software by all that much, if at all. The existing track record for this kind of software is already fairly good and when problems do arise it’s rarely because of a failure in the existing procedures as such, it’s because people decided to not follow them. Replacing the procedures with something else isn’t going to help much with that.

                                                                                                          It may replace existing test procedures and the like with something more efficient and time-consuming, but that’s not quite the same.

                                                                                                          And the Toyota problem wasn’t even necessarily something that would have been caught, as it was essentially a hardware problem inadequately handled by software. You can’t prove the correctness of something you never implemented.

                                                                                                          1. 1

                                                                                                            Proving software correct is categorically better verification than testing it with examples. If we can achieve provably-correct software, we should not throw that away just because people are not even verifying with example testing.

                                                                                                            Programming languages that are provable increase the ceiling of potential verification processes. Whether a particular company or industry chooses to meet that ceiling is an entirely different discussion.

                                                                                                2. 3

                                                                                                  Have you run across any good SPARK resources? I’m interested in learning not only about how to use it, but also the theory behind it and how it’s implemented.

                                                                                                  1. 4

                                                                                                    https://learn.adacore.com/ should be a good starting point to learn Spark. If you’re interested in how it’s implemented, reading https://www.adacore.com/papers might be a good way to learn.

                                                                                                1. 29

                                                                                                  I agree with the points raised in the article. One thing to add is that many tools built in academia are there to support the paper, not to take a life of their own. Given the short life of a thesis, there is not much one could do to make these tools gain much traction (all examples in the article show this).

                                                                                                  There is another weird thing in our industry — I’ve seen many companies embracing the latest shiny tools and frameworks (k8s, react, containers, …), yet when it comes to thinking and implementing ways to speed up developer work by improving build times or reorganize dependencies, that is always a task for the intern, the least suitable person to handle such a thing.

                                                                                                  1. 10

                                                                                                    yet when it comes to thinking and implementing ways to speed up developer work by improving build times or reorganize dependencies, that is always a task for the intern, the least suitable person to handle such a thing.

                                                                                                    That has not been my experience. I’ve been tasked with that sort of work many times, and given permission to pursue it many more. Currently a senior UI engineer at Microsoft, but previously worked at a 500 person EdTech company for several years, and had a brief, but enjoyable stint at Todoist. All three were very happy to give that work to an experienced developer.

                                                                                                    Have I had managers who were averse to that? Yes. But it has been the exception.

                                                                                                    1. 3

                                                                                                      Agreed with this. My experience at both large and small tech firms has been that when a senior person wanted to work on developer productivity, it was almost always welcomed. Maybe it’s different at companies whose product is something other than what the developers are building, though.

                                                                                                    2. 7

                                                                                                      The problem in academia is that it is controlled by where the professor can get money from. There is a lot more grant money for making shiny new tools than for maintaining and adapting old tools, and your academic life depends really strongly on how much grant money you can bring in (determines how many students you can support, how many conferences you can send your students to, and what equipment you can give your students). (New papers are also indirectly about grant money. You need new papers to bring in grant money, and you can’t write new papers simply by maintaining or updating existing tools).

                                                                                                      1. 3

                                                                                                        I agree, and it’s also about promotion criteria; it would be hard to get tenure if all you ever worked on was maintaining old tools. I think the solution is to create a new tenure-track academic role which is evaluated, promoted, and funded based on the maintenance and creation of useful open-source artifacts (rather than the traditional goal of the discovery of novel knowledge).

                                                                                                    1. 6

                                                                                                      Another solution to this problem would be to create a new kind of academic role whose function is to create and maintain open-source tools/content. Economically, the academic system is fundamentally well-suited to solve any societal tragedy-of-the-commons regarding intellectual public goods. However, the current system is more narrowly focused on the goal of discovering novel knowledge, which is only one special case of intellectual public goods.

                                                                                                      We should create a role of “open source professor” whose job would be to create and/or maintain open artifacts which are useful to society, rather than to make new discoveries. This would have applications beyond just developer tools, too.