1. 47
  1.  

  2. 19

    They do mention it in passing, but I really can’t help but feel that the approach outlined here is probably not the best option in most cases. If you are measuring your memory budget in megabytes, you should probably just not use a garbage collected language.

    1. 19

      All of the memory saved with this linker work had nothing to do with garbage collection.

      1. 7

        Sure, but that’s tangential to my point. In a gced language, doing almost anything will generate garbage. Calling standard library functions will generate garbage. This makes it difficult to have really tight control of your memory usage. If you were to use, for example, c++ (or rust if you want to be trendy) you could carefully preallocate pretty much everything, and at runtime have no dynamic allocation (or very little, and carefully bounded, depending on your problem and constraints). This would be (for my skillset, at least) a much easier way to keep memory usage down. They do mention they have a lot of go internals expertise, so maybe the tradeoff is different for them, but that seems like an uncommon scenario.

        1. 1

          I wouldn’t say that, because it’s likely that they wouldn’t have been short on memory to begin with if they hadn’t used a GC language. (And yes, I’m familiar with the pros and cons of GC; I’m writing a concurrent compacting GC right now for work.)

        2. 2

          Only maybe. Without a gc long running processes can end up with really fragmented memory. With a gc you can compact and not waste address space with dead objects.

          1. 18

            If you’re really counting megs, perhaps the better option is to forgo dynamic heap allocations entirely, like an embedded system does.

            1. 4

              Technically yes. But they probably used this to deploy one code base for everything, instead of rewriting this only for the iOS part.

              1. 2

                Exactly this. You can try to do this in a gced language, and even make some progress, but you will be fighting the language.

                1. -2

                  You should probably write it all in assembly language too.

                  1. 7

                    I feel like you’re being sarcastic, but making most of the app not do dynamic applications is not a crazy or extreme idea. It’s not super common in phone apps and the system API itself may force some allocations. But doing 90+% of work in statically allocated memory and indexed arenas is a valid path here.

                    Of course that would require a different language than Go, which they have good reasons not to do.

                    1. 1

                      I’m being sarcastic. But one of the issues identified in the article is that different tailnets have different sizes and topologies - they rejected the idea of limiting the size of networks that would work with iOS which is what they’d need to do if they wanted to do everything statically allocated.

                      1. 2

                        they rejected the idea of limiting the size of networks

                        They’re already limited. They can’t use more than the allowed memory, so the difference is - does the app tell you that you reached the limit, or does it get silently killed.

                        I believe that fragment was related to “how other team would solve it keeping other things the same” (i.e. keeping go). Preallocation/arenas requires going away from go, so it would give them more possible connections not less.

                2. 10

                  That is absolutely not my experience with garbage collectors.

                  Few are compacting/moving, and even fewer are designed to operate well in low-memory environments[1]. golang’s collector is none of that.

                  On the other hand, it is usually trivial to avoid wasting address space in languages without garbage collectors, and a application-specific memory management scheme typically gives 2-20x performance boost in a busy application. I would think this absolutely worth the limitations in an application like this.

                  [1]: not that I think 15mb is terribly low-memory. If you can syscall 500 times a second, that equates to about 2.5gb/sec transfer filling the whole thing - a speed which far exceeds the current (and likely next two) generations of iOS devices.

                  1. 4

                    To back up what you’re saying, this presentation on the future direction that the Golang team are aiming to take is worth reading. https://go.dev/blog/ismmkeynote

                    At the end of that presentation there’s some tea-leaf reading about the likely direction that hardware development is likely to go in. Golang’s designers are betting on DRAM capacity improving in future faster than bandwidth improvements and MUCH faster than latency improvements.

                    Based on their predictions about what hardware will look like in future, they’re deliberately trading off higher total RAM usage in order to get good throughput and very low pause times (and they expect to move further in that direction in future).

                    One nitpick:

                    Few are compacting/moving,

                    Unless my memory is wildly wrong, Haskell’s generation 1 collector is copying, and I’m led to understand it’s pretty common for the youngest generation in a generational GC to be copying (which implies compaction) even if the later ones aren’t.

                    I believe historically a lot of functional programming languages have tended to have copying GCs.

                    1. 2

                      At the end of that presentation there’s some tea-leaf reading about the likely direction that hardware development is likely to go in. Golang’s designers are betting on DRAM capacity improving in future faster than bandwidth improvements and MUCH faster than latency improvements.

                      Given the unprecedented semiconductor shortages, as well as crypto’s market influence slowly spreading out of the GPU space, that seems a risky bet to me.

                      1. 1

                        That’s the short term, but it’s not super relevant either way. They’re betting on the ratios between these quantities changing, not on the exact rate at which they change. If overall price goes down slower than desired, that doesn’t really have any bearing.

                    2. 1

                      Aren’t most GCs compacting and moving?

                      The first multi-user system I used heavily was a SunOS 4.1.3 system with 16MB of RAM. It was responsive with a dozen users so long as they weren’t all running Emacs. Emacs, written in a garbage collected, interpreted language would have run well on a much smaller system if there was only one user.

                      The first OS I worked on ran in 16MB of RAM and ran a Java VM and that worked well.

                    3. 1

                      Any non-moving allocator is vulnerable to fragmentation from adversarial workloads (see Robson bounds), but modern size-class slab allocators (“segregated storage” in the classical allocation literature) typically keep fragmentation quite minimal on real-world workloads. (But see a fascinating alternative to compaction for libc-compatible allocators: https://github.com/plasma-umass/Mesh.)

                    4. 1

                      This does strike me as a place where refcounting might be a better approach, if you’re going to have any dynamic memory at all.

                      1. 1

                        With ref-counting you have problems with cycles and memory fragmentation. The short-term memory consumption is typically lower with ref-counting than a compacting GC, but the are many more opportunities to have leaks and grow over time. For a long-running process I’m skeptical that ref-counting is a sound choice.

                        1. 1

                          Right. I was thinking that for this kind of problem with sharply limited space available you’d avoid the cycles problem by defining your structs so there’s no void* and the types form a DAG.

                      2. 1

                        Edit: reverting unfriendly comment of dubious value.

                      3. 8

                        I love a highly detailed blog article that solves a deep technical problem, and then ends with the constraints being removed!

                        1. 4

                          Hey Tailscale! Thanks for improving Go for the rest of us too. :)

                          1. 1

                            While we were busy fixing the linker to save 1MB, iOS 15 launched and quietly gave us 35MB more.

                            My original reaction was that Apple should have stood their ground and not caved into any pressure they might have been getting to relax the constraint. After all, bloat is bad, right? But then, I know from the problems with the original 128K Macintosh, as described in Insanely Great, that sometimes an arbitrarily chosen constraint really can be too severe.

                            1. 6

                              That last line there made me laugh pretty hard and reflect on Google App Engine circa 2008. At the time it only supported Python, it had a 2 second request limit, and a cold start of your app had to fit into the 2 second limit.

                              We jumped through all kinds of hoops to satisfy that, including reverse engineering parts of the system so that we could execute backend requests in parallel even though the public API didn’t support that. And then the runtime limit was bumped from 2 seconds to 30 seconds.

                              And… a year or two later the CTO left and joined the GAE team!

                            2. 1

                              The article has little touch on the core problem: what relocation types are saved? Are they relative relocations?

                              From some commits like “cmd/link,runtime: remove relocations from stkobjs”, the change is turning pointer which may require a relative relocation to an uint32 offset.