Threads for hherman

  1. 5

    Ok so go isn’t rust, scala, or haskell. Pretty much all these criticisms could be leveled at Python or ruby (at least without their optional type checkers).

    I don’t think it’s coincidental- golang is a c flavored language that targets much the same programmer use cases as Python and ruby. Lots of programming has a lot of io, a lot of iteration over requirements that are completely ill defined, and needs to be easy to read. Yes this potentially lets bugs hide and requires that you run the code to achieve reasonable certainty that it works, but there’s a reason we’re not all programming in agda.

    1. 4

      I think your comment is an interesting intermediate phase which precedes the point in the discussion where Go is directly compared to Rust alone.

      In the first part of the article, we are linked to two case studies for Go’s unsuitability as a low-level systems language. In the first case study, a standard library structure is mutable and large. In the second case study, the standard linker and GC have large footprints. Scala and Haskell are no better than Python and Ruby at managing runtime footprints or structure size, because none of them are low-level systems languages. This leaves Rust as the lone contender in the discussion.

      1. 5

        I don’t really believe that the majority of Go programs have much overlap with the majority of Rust programs. Sure there are cases where both Go and Rust are legitimate candidates for the same software, but I think Go is much more often competing with Java and Python than Rust, and Rust is more often competing with C++/C than Go.

        1. 2

          Go was intended to replace C++. This is a well-cited motivation; Wikipedia says, “[Go’s] designers were primarily motivated by their shared dislike of C++.” I was at Google when Go was ascendant, and it was internally pitched as a replacement for C++, which was (and still is) one of the main pillars of Google code.

          It was a curious turn of events when many Python and Java programmers switched to Go; it had not been anticipated. Go lacked the powerful dependency-injection tooling available to Java programmers, and also the flexible metaprogramming of Python; it wasn’t intended to be a compelling alternative for them.

          1. 2

            In fairness, go is a good replacement for the use case of “I want C++ because it’s compiled and I’m not especially interested in safety or latency”.

    1. 3

      The tldr seems to be that Go generics treat all pointer typed type parameters as being the same, so fails to monomorphise, resulting in needlessly slow code.

      1. 10

        Slightly more TL;DR: it monomorphizes by “shape” rather than type, so it has to pass the vtable as a hidden parameter to the monomorphized functions, and the vtable parameter makes inlining hard for the Go compiler and results in extra pointer dereferences.

        It does have some good news though: using a generic byteseq is just as fast as using raw string and []byte!

        1. 10

          fails to monomorphise, resulting in needlessly slow code.

          But less code. It’s a tradeoff, not a mistake.

          1. 6

            Indeed, and one that the Go team worked very hard to calibrate to the right level for their context. Not every language should make the same choice Rust and C++ make of fully monomorphizing generics (in both cases with optional mechanisms [virtual in C++ and trait objects in Rust] to escape).

            1. 6

              +1

              Canonical post on the topic: https://research.swtch.com/generic

              1. 0

                There are many ways to reduce code generated, but there are very few cases where people choose absurdly slow code over code size. Honestly this still just feels like the ongoing saga of the Go team hating generic programming and passive aggressively making design choices for the express purpose of penalizing such code.

                1. 5

                  Honestly this still just feels like the ongoing saga of the Go team hating generic programming and passive aggressively making design choices for the express purpose of penalizing such code.

                  What if you’re wrong about their motives?

                  1. 1

                    I said “feels like”

                    What I do know is that the core Go team has spent years arguing against generics, and now that they’ve finally added support the implementation’s performance characteristics seem significantly worse than pretty much every other implementation of generics outside of Java.

                    1. 4

                      You already admitted that you have no evidence to claim that ‘the core Go team has spent years arguing against generics’. You have no credibility to argue about Go.

            1. 6

              I hope the addition of generics doesn’t end with Go becoming indistinguishable from every other mainstream language. I like it’s quirks.

              1. 1

                Me. Too. I love Go’s perspective and I’d hate to see it become too similar to other languages.

              1. 14

                You mean Go wasn’t using registers for function parameters until now? 😧

                I’m surprised, because mainstream ABIs have been doing this for a very long time (except x86-32, which has too few registers.) That it took Go this long seems like another casualty of rolling its own ABI.

                I remember PowerPC having another neat optimization that IIRC isn’t in x86 or ARM — the subroutine-call instruction put the return address in a special Link register, instead of pushing it on the stack. Then returning was a Branch To Link Register instruction. That way calling a leaf function required no memory access at all. (Non-leaf functions did still have to push a stack frame, including the saved LR value.)

                1. 7

                  I remember PowerPC having another neat optimization that IIRC isn’t in x86 or ARM — the subroutine-call instruction put the return address in a special Link register, instead of pushing it on the stack.

                  This is the same on Arm (bl - branch and link - puts the return address in lr - the link register). This design, I believe, comes from early RISC designs that wanted orthogonal instructions. As a design principle, only load/store instructions should load from or store to memory, everything else should access only internal CPU state.

                  Some architectures have a designated link register but most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register. The more general functionality means that branch-and-link instructions are bigger (they need 5 bits more space on architectures with 32 registers), which consumes encoding space that could probably be better used for other things. In particular, if you’ve got PC-relative addressing then you can emulate the general-purpose bl with an add of PC followed by a branch.

                  There was some debate in early RISC designs about whether branch-and-link was actually necessary. The compiler could always generate something like:

                  add lr, pc, 8
                  b {destination}
                  

                  Assuming the value of PC is the value at the start of the instruction (for interesting historical reasons on Arm it’s displaced forward) and add and branch are each 4-byte instructions. The general consensus was that it was worth folding these into a single instruction because:

                  • Subroutine call is a very common operation (probably the only thing I disagree with Dijkstra on: he wrote an essay deploring special-purpose instructions for subroutine calls)
                  • A branch instruction does not write to any other register
                  • Any instruction reads the PC and so there’s no extra register read

                  I believe most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register and assuming that the tiny number of cases that wanted a different destination could use the two-instruction sequence above.

                  RISC-V, as is traditional, did the wrong thing and ended up using 5 bits in their JAL / JALR instructions to encode one bit of state (should the return address go to the link register or be discarded?). This means that JAL and JALR (J and JR are aliases for these that write the result to x0 [i.e. discard it]) consume 16 times as much encoding space as they need to. This is a problem because these instructions have large immediates and so consume a lot of encoding space even without the extra bloat. JAL consumes an entire major opcode, in spite of needing only 1/16th of one for exactly the same code density to within a rounding error. This means that they didn’t have the encoding space left for rich addressing modes.

                  1. 2

                    David, in the spirit of (US) Thanksgiving, I’m thankful for your detailed & informative replies on lobste.rs!

                    One quibble: it seems you contradicted yourself, saying first “most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register”, but then later “most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register”.

                    I think I agree that it’s not worth the expense to allow any register to act as a link register. It’s hard to think of how you’d take advantage of multiple registers for this, given that caller and caller have to agree on which register to use.

                    1. 3

                      David, in the spirit of (US) Thanksgiving, I’m thankful for your detailed & informative replies on lobste.rs!

                      Thanks!

                      One quibble: it seems you contradicted yourself, saying first “most RISC ISAs make the link register a GPR and allow the branch-and-link to write to any register”, but then later “most ISAs went the same way as Arm, providing a branch-and-link that uses an architecturally defined link register”.

                      Yes, sorry. ‘Most RISC ISAs’ is a bit of a fluffy concept. I think (from memory), Arm (A32 / T32 / A64) use a fixed link register, PowerPC allows any target, SPARC doesn’t (because it’s coupled to the register window concept), and I don’t remember what PA-RISC or Alpha did. There are lots of RISC ISAs that have died or never got large market share, so both are probably correct depending on how you count ‘most’. In terms of ‘most units shipped’, Arm alone probably counts. In terms of most distinct ISAs, I have no idea. In terms of most commercially successful ISAs, I think it’s probably fairly evenly split. I have no idea which of the meanings I meant each time or which one is actually correct for several interpretations.

                      I think this also shifted a bit over time. RISC-I and MIPS had a strong design principle that the ABI should not leak into the ABI. RISC-II and SPARC went the other way. Arm tries to avoid it, but is willing to compromise. In the extreme case, the stack pointer is not special in any way, but loads and stores on the stack are incredibly common (and a very large proportion of the static instruction count: loads and stores over heap or address-taken stack variables are often in loops and so contribute less to static instruction count relative to the number that appear in the dynamic instruction count) and so they’re worth treating as special cases.

                      I think I agree that it’s not worth the expense to allow any register to act as a link register. It’s hard to think of how you’d take advantage of multiple registers for this, given that caller and caller have to agree on which register to use.

                      It is useful in theory for leaf functions that are not exposed across a compilation-unit boundary. These can use a different link register (because they don’t have to respect the ABI) and use a calling convention that makes the normal link register callee-save. This allows the caller to avoid having to safe the real link register. It’s also useful on architectures that don’t have PC-relative addressing.

                      If I remember correctly, PowerPC uses the ‘branch-and-link-to-next-instruction’ instruction as a ‘move PC to general-purpose register’ instruction (with special-case logic in the branch predictor to understand that it’s not really a jump if the immediate offset is +4), which allows them to avoid making the PC a general-purpose register and avoid needing special-purpose instructions for PC-relative addressing. AArch64 and RISC-V, for example, both have explicit instructions for calculating an address with PC as the base, which consume quite a bit of encoding space, which PowerPC can avoid by having a single cheap instruction to get the PC.

                      On 32-bit x86, you do something similar: call one instruction forward and pop from the stack. If you’re lucky, the implementation will be able to do some micro-op fusion. If not, then the fact that the call and the pop both modify the stack pointer means that you end up with some horrible interaction with register renaming. This composes incredibly badly with the fact that you have only six GPRs on x86, so if you need one for PC-relative addressing you’re down to five. If you need 64-bit arithmetic, you are now down to two pairs, which since all of the instructions are destructive means that you need more stack spills for any temporaries you want to preserve. This is why 32-bit Windows doesn’t use position-independent code for DLLs and instead does a fun constraint-solving thing to find static base addresses for each DLL that can work with every installed program and does static relocations in the DLLs.

                      1. 1

                        If I remember correctly, PowerPC uses the ‘branch-and-link-to-next-instruction’ instruction as a ‘move PC to general-purpose register’ instruction (with special-case logic in the branch predictor to understand that it’s not really a jump if the immediate offset is +4), which allows them to avoid making the PC a general-purpose register and avoid needing special-purpose instructions for PC-relative addressing. AArch64 and RISC-V, for example, both have explicit instructions for calculating an address with PC as the base, which consume quite a bit of encoding space, which PowerPC can avoid by having a single cheap instruction to get the PC.

                        FWIW, POWER9 did add an instruction to add from PC. They also “added” a new system call instruction, so I expect a new ABI.

                  2. 3

                    A couple of things can be made simpler.

                    For instance, when you pass parameters on the stack and assume that all registers are callee-clobbered, every function call can be used as a preemption point for goroutines. During the call, only the PC needs to be saved, all other state is already saved on the stack.

                    I believe I read that parameter passing on the stack also helped with stack growth and shrinking when pointers to memory on the stack need to be rewritten. Then at least you only have to find values in one buffer, the stack, not also in a secondary layout of the saved registers on the stack or somewhere close.

                    1. 3

                      Thanks for pointing that out BTW. This made me look at the proposal which includes some justification for this design (yes, plan9). :)

                      1. 2

                        I feel like your post is one sided. I don’t understand them in detail, but my impression last I heard about it was that there were very good reasons for go to have its own ABI.

                        1. 5

                          Well, the segmented stacks originally used for goroutines may have pushed them to a nonstandard ABI. And there may be other ways that it got them some marginal optimizations. But they paid for it with extra overhead calling to/from any other language (including system calls) and incompatibility with other tooling like debuggers and crash reporters. Other non-interpreted languages have managed to provide GC and green threads without breaking the platform ABI.

                          (I used Go a lot at one point, but the architects’ resolute NIH (or more accurately, NIBP9) attitude was one of the things that soured me.)

                        2. 1

                          MIPS and (I think) SPARC also do this as well.

                          1. 4

                            ARM and AArch64 and RISC-V all do this as well. It seems to be just one of those small innovations that make life a little better for everyone.

                            1. 1

                              For RISC-V, it’s part of the ISA’s own ABI documents, even. RISC-V has 32 GPRs (31 if not counting x0, hardcoded to zero).

                              Register-based parameter passing is sanity. Stack-based exists, because of register-starved architectures.

                              Same deal for storing the return address of a call in the stack; RISC-V and other sane architectures save it into a register. Of course, it’ll end up saved in the stack anyway when making deeper calls.

                          2. 1

                            x86-32, which has too few registers

                            fastcall dates to, what, the early 90s?

                            1. 4

                              Fastcall isn’t always faster. Pushing and popping to the stack is fairly quick on x86, but you only have 6 GPRs (and they’re not all really general-purpose, some instructions require specific registers) and so if you pass arguments in registers then the caller needs to spill some things to make room for them and the callee needs to spill the to make room for anything else it does.

                              Most RISC platforms have used register-based calling conventions since their creation and intentionally had enough registers to make this efficient. RISC-II and SPARC both optimised this further with register windows, where each function sees only a subset of the full register file, with some reserved for incoming arguments, some for local use, and some for outgoing arguments. With RISC-II this wasn’t great because it had a fixed number of windows (8, I think) and hitting this limit in call depth required some special-case behaviour.

                              SPARC, in contrast, arranges them in a ring so that when you hit the end you get a trap, spill some of them, and then can continue. The very latest UltraSPARC chips (which shipped after everyone who didn’t work at Oracle had given up on SPARC) apparently finally implemented the obvious extension to this and spill things from old register windows asynchronously. This means that any time the load-store pipeline is idle, the CPU can spill old registers in the background and unless you’re basically doing either recursive function calls with no work in them or function calls with nothing other than memory ops at every recursion depth then you end up with a CPU that performs as if it had an unlimited number of registers.

                              On a modern CPU, register windows are not really subsets of the register file, they’re namespaces in the register rename engine, so the asynchronous spill gives the CPU a nice way of reclaiming rename registers when they’re under pressure.

                              1. 1

                                Yeah I’m sure borland C++ was using AX-DX for the first 4 params back in the day, and you needed to specifically override it if you wanted to use the stack only. Maybe Turbo Pascal, too.

                                1. 1

                                  Apologies, I’ve only used x86 on Macs, so I’m not familiar with the 32-bit ISA.

                              1. 1

                                Concise code is not in conflict with readability, it enables readability.

                                APL devs will be thrilled to see this