1. 10
  1.  

  2. 3

    The fact that it works at all is amazing. However, 6502 is a really tough target for compiled languages. Even something as basic as having a standard function calling convention is expensive.

    1. 3

      GEOS has a pretty interesting calling convention for some of its functions (e.g. used at https://github.com/mist64/geowrite/blob/main/geoWrite-1.s#L82): Given that there’s normally no concurrency, and little recursive code, arguments can be stored directly in code:

      jsr function
      .byte arg1
      .byte arg2
      

      function then picks apart the return address to get at the arguments, then moves it forward before returning to skip over the data. A recursive function (where the same call site might be re-entered before leaving, with different arguments) would have to build a trampoline on a stack or something like that:

      lda #argcnt
      jsr trampoline
      .word function
      .byte arg1
      ...
      .byte argcnt
      

      where trampoline creates jsr function, a copy of the arguments + rts on the stack, messes with the returrn address to skip the arguments block, then jumps to that newly created contraption. But I’d rather just avoid recursive functions :-)

      1. 1

        Having to need self-modifying code to deal with function calls is reminding me of the PDP-8, which didn’t even have a stack - you had to modify code to put your return address in.

        1. 1

          Are those the actual arguments and self-modifying code is used to get non-constant data there? Or are the various .byte values the address to find the argument, in Zero Page?

          That’s pretty compact at the call site, but a lot of work in the called function to access the arguments. It would be ok for big functions that are expensive anyway, but on 6502 you probably (for code compactness) want to call a function even for something like adding two 32 bit (or 16 bit) integers.

          e.g. to add a number at address 30-31 into a variable at address 24-25 you’d have at the caller …

              jsr add16
              .byte 24
              .byte 30
          

          … and at the called function …

          add16:
              pla
              sta ARGP
              tax
              pla
              sta ARGP+1
              tay
              clc
              txa
              adc #2
              pha
              tya
              adc #0
              pha
              ldy #0
              lda (ARGP),y
              tax
              iny
              lda (ARGP),y
              tay
          
          add16_q:
              clc
              lda $0000,y
              adc $00,x
              sta $00,x
              lda $0001,y
              adc $01,x
              sta $01,x
              rts
          

          So the stuff between add16 and add16_q is 26 bytes of code and 52 clock cycles. The stuff in add16_q is 16 bytes of code and 28 clock cycles. The call to add16 is 5 bytes of code and 6 clock cycles.

          It’s possible to replace everything between add16 and add16_q with a jsr to a subroutine called, perhaps, getArgsXY. That will save a lot of code (because it will be used in many such subroutines) but add even more clock cycles – 12 for the JSR/RTS plus more code to pop/save/load/push the 2nd return address on the stack (26 cycles?).

          But there’s another way! And this is something I’ve used myself in the past.

          Keep add16_q and change the calling code to…

              ldx #24
              ldy #30
              jsr add16_q
          

          That’s 7 bytes of code instead of 5 (bad), and 10 clock cycles instead of 6 – but you get to entirely skip the 52 clock cycles of code at add16 (maybe 90 cycles if you call a getArgsXY subroutine instead).

          You may quite often be able to omit the load immediate of X or Y because one or the other might be the same as the previous call, reducing the calling sequence to 5 bytes.

          If there’s some way to make add16 more efficient I’d be interested to know, but I’m not seeing it.

          Maybe you could get rid of all the PLA/PHA and use TSX;STX usp;LDX #1;STX usp+1 to duplicate the stack pointer in a 16-bit pointer in Zero Page, grab the return address using LDA instead of PLA, and increment the return address directly on the stack. It’s probably not much better, if at all.

          1. 1

            These calling conventions are provided for some functions only, and mostly the expensive ones. From the way it’s implemented for BitmapUp, without looking too closely at the macros, it seems they store the return address at a known address and index through that.

            GEOS has pretty complex functions and normally uses virtual registers in the zero page, so I guess this is more an optimization for constant calls: no need to have endless lists of lda #value; sta $02; ... in your code - as GEOS then copies it into the virtual registers and just calls the regular function, the only advantage of the format is compactness.

        2. 2

          Likewise, I’m very impressed it works. Aside from you correctly pointing out how weak stack operations are on the 6502, however, it doesn’t generate even vaguely idiomatic 6502 assembly. That clear-screen extract was horrible.

          1. 2

            The 6502 is best used treating zero page as a lot of registers with the same kind of calling convention as modern RISC (and x86_64) use: some number of registers that are used for passing arguments and return values and for temporary calculations inside a function (and so that leaf functions don’t have to save anything), plus a certain number of registers that are preserved over function calls and you have to save and restore them if you want to use them. The rest of zero page can be used for globals, the same as .sdata referenced from a Global Pointer register on machines such as RISC-V or Itanium.

            If you do that then the only stack accesses needed are push and pop or a set of registers. If you generate the code appropriately then you only have to know to save N registers on function entry and restore the same N and then return on function exit. You can use a small set of special subroutines for that, saving code size. RISC-V does exactly the same thing with the -msave-restore option to gcc or clang.

            Of course for larger programs you’ll want to implement your own stack (using two zero page locations as the stack pointer) for the saved registers. 256 bytes should be enough for just the function return addresses.

            1. 1

              But I wonder how much of the zero page you can use without stepping on the locations reserved for ROM routines, particularly on the Apple II. It’s been almost three decades since I’ve done any serious programming on the Apple II, but didn’t its ROM reserve some zero-page locations for allowing redirection of ROM I/O routines? If I were programming for that platform today, I’d still want to use those routines, so that, for example, the Textalker screen reader (used in conjunction with the Echo II card) would work. My guess is that similar considerations would apply on the C64.

              1. 1

                The monitor doesn’t use a lot. AppleSoft uses a lot more, but that’s ok because it initialises what it needs on entry.

                https://pbs.twimg.com/media/E_xJ5oWUYAAUo3a?format=jpg&name=4096x4096

                Seems a shame now to have defaced the manual, but in my defence I did it 40 years ago.

              2. 1

                Now I’ve looked into the implementation I see they’re doing something like this, but using only 4 zero page bytes as caller-saved registers. This is nowhere near enough!

                Even 32 bit ARM uses 4 registers, which should probably translate to 8 bytes on 6502 (four pointers or 16 bit integers).

                x86_64, which has the same number of registers as arm32, uses six argument registers. RISC-V uses 8 argument registers, plus another 7 “temporary” registers which a called function is free to overwrite. PowerPC uses 8 argument registers.

                6502 effectively has 128 16-bit registers (the size of pointers or int). There is no reason why you shouldn’t be at least as generous with argument and temporary registers as the RISC ISAs that have 32 registers.

                I’d suggest maybe 16 bytes for caller-save (arguments), 16 bytes for temporaries, 32 bytes for callee-save. That leaves 192 bytes for globals (2 bytes of which will be the software stack pointer).

                1. 1

                  Where are you going to save them? In the 256 BYTE stack the 6502 has? Even if the stack wasn’t limited, you still only have as most 65,536 bytes of memory to work with.

                  1. 1

                    Would be cool to see if this stuff were built to expect bank switching hardware.

                    1. 1

                      I quote myself:

                      Of course for larger programs you’ll want to implement your own stack (using two zero page locations as the stack pointer) for the saved registers. 256 bytes should be enough for just the function return addresses.

                      64k of total memory is of course a fundamental limitation of the 6502, so is irrelevant to what details of code generation and calling convention you use. Other than that you want as compact code as possible, of course.