1. 13
  1.  

    1. 10

      The presentation gives really absolutely zero information on the actual ISA, just some desirable features.

      In order to find out what’s actually going on I had to clone their gitrhub repo, install tex and learn how to find and install the missing “bytefield” package, and generate a PDF. Or I could have read the tex sources I guess but ick.

      So, in short, there’s an 8 bit status register and 16 bit PC, SP, and X / Y / Z registers. The latter three of which can be split into H and L halves for 8 bit work. The 8 bit instructions generally use XL as an accumulator, and similarly X for 16 bit instructions. There are some prefix bytes available to tell it to use something else as the accumulator. There are 8 and 16 bit immediate/offsets. So basic instructions are one byte, optionally plus 1 or 2 immediate, optionally plus a prefix byte.

      There are no code examples (at least in the manual), and no justification that the result gives more compact code overall than 8051 or RISC-V or Arm Thumb{2} or M6809.

      One thing I can say immediately is there aren’t enough registers to compile …

      char* memcpy(char *dst, char *src, size_t n) {
        char *ret = dst;
        char *limit = src + n;
        while (src < limit) *dst++ = *src++;
        return ret;
      }
      

      … (or any other similar formulation) without spilling values to the stack. In fact you can’t even code the while loop as it needs all three 16 bit registers leaving nowhere to load/store the actual data to, and there is no ld (x),(y) or similar instruction (at most one operand is in memory).

      So from my point of view, as someone who has tried designing his own ISA with similar goals, this fails at the first hurdle – if you can’t efficiently implement the functions in string.h then it’s just a fail before you even look at anything else.

      The best you can do here is the 6502 trick of nested loops and put the low 8 bits of n in, say, XH, and use XL to load/store the character data, with a DJNZ on XH, and an outer loop that loads the upper 8 bits of n from the stack, decrements it if it’s not zero, and starts another inner loop copying up to 256 bytes. Using Y and Z fr src and dst pointers, obviously.

      That will be fast, but it’s big code because of the nested loops and needing to spill and repeatedly reload the upper bits of n. You also need to spill the original value of dst so you can return it as the function result at the end (this is seldom used in practice, but the memcpy() spec requires it.

      It’s not going to be easy to get a C compiler to automatically make the transformation to nested loops. Probably the best that can be done with the given registers is to store limit on the stack and use a compare instruction between it and the src register. But 16 bit compares are only with constants. Sooo … I think the best that can be done is xchw y, (n, sp) to swap dst and limit or dst and n between a register and a stack location.

      I think the minimum viable register set for reasonable compiled code is four 16 bit GPRs, all usable for integer calculations, at least two usable as pointers (but preferably three so you can do *x = *y + *z). Preferable plus a separate SP. And of course PC.

      This is basically DG NOVA. which only had the 4 GPRs (any of which could be used as a stack pointer), and ECLIPSE which added a dedicated SP.

      I’d like to see some significant examples of compiled C, because I have very strong doubts that in practise this ISA is going to give more compact code in real-world situations than RV32EC (in the $0.10 CH32V003) or MSP430. or Cortex-M0+ (in RP2040, or in the also $0.10 PY32F002A).

      1. 5

        Wow! Seems like they did a bad job of recreating the 6809, which has two 8-bit accumulators (which can be combined into a 16-bit accumulator), and four 16-bit index registers. Your memcpy() function is just:

        ;--------------
        ; X - src
        ; U - dest
        ; Y - count
        ;---------------
        
        memcpy  lda     ,x+
                sta     ,u+
                leay    -1,y
                bne     memcpy
                rts
        

        In fact, the 6809 was designed to be an 8-bit CPU that could be targetted by higher level languages like C and Pascal.

        1. 5

          Yup, the 6809 – which I was using back in 1983, and we created a BCPL back end for – meets my “minimum useful registers” spec, with four pointer registers (you can even repurpose S temporarily if you turn off interrupts and have some global place to stash it).

          Your lda, sta, and leay all need 1 indexing postbyte, and bne needs an offset, so you’ve got 9 bytes of code there.

          But you don’t quite meet the memcpy spec. Your code doesn’t work if count is 0, so you’ll need an extra leay 0,y and beq ret. Also you don’t return the original dst. Adding stu ,--s at the start and ldd ,s++ (or some other 2 byte register) before the rts will fix that, for a total of 17 bytes of code.

          ARMv7 needs 22 bytes for the same function, and RISC-V RV32IC needs 24 bytes.

          The RISC-V Zcb extension found in e.g. the RP2350 (Pi Pico 2) and the RVA23 profile allows 20 bytes by giving 2-byte instructions for lbu and sb with 0 offset:

          memcpy:
              beqz    a2,ret
              add     a2,a1,a2
              mv      a5,a0
          loop:
              lbu     a4,0(a1)
              sb      a4,0(a5)
              addi    a1,a1,1
              addi    a5,a5,1
              bgtu    a2,a1,loop
          ret:
              ret
          

          All instructions are 2 bytes except the bgtu a2,a1,loop which is 4 bytes.

          So that’s bigger than the corrected 6809 function, but not much! Just 3 bytes. One of those comes from ret being just one byte. The others essentially come from the autoincrement addressing modes.

          1. 3

            I think Thumb 2 is 12 bytes?

            memcpy:
                cbz r2, done
            loop:
                subs r2, r2, #1
                ldrb r3, [r1, r2]
                strb r3, [r0, r2]
                bne loop
            done:
                bx lr
            
            1. 2

              Nice, and ok for memcpy() as the order is not defined.

              But I was thinking of Cortex-M0 as the small area CPU that competes against 8 bit cores, and it doesn’t have cbz [1] and the ldrb will I think nuke the flags from the sub? Or no .. I haven’t grovelled around with 32 bit Arm for a while, used to do a lot of asm in ARM7TDMI days

              This is a case where the indexed addressing that RISC-V currently lacks does make a difference, both in reducing the number of things that need to be incremented/decremented, and in leaving dst untouched

              But it certainly reinforces the point that you don’t need an 8 bit CPU with byte-oriented instructions to get small code.

              [1] But a newer version of it could, if desired, as it’a a 2-byte instruction.

              1. 1

                If my ancient ARM knowledge is not deceiving me, I think the only normal instructions that affect the flags are comparison instructions and arithmetic instructions with the S bit set. (Or loading flags from memory.) Controlling when the flags are set allows neat predicate chains in things like if statements, eg (in arm 32)

                    if (a != b && a != c)
                        stuff;
                
                    cmp a, b
                    cmpne a, c
                    beq else
                    stuff
                else:
                

                I haven’t played around with thumb2 IT instructions but they seem quite fun. On the other hand the breakeven point for predication vs branches is quite tight so there isn’t much incentive to get really silly with them :-)

                Of course, replacing the cbz with cmp r2, #0; beq done only costs 2 more bytes so it’s still not too bad.

                1. 1

                  Once again, I’m looking at the small cores that compete with 8 bit, so Thumb1 / ARMv6-M. You don’t get a choice about an “S” bit on those. Instructions either set flags or don’t.

                  But on checking, you’re right. MOV sets flags but loads including pop don’t set flags so yeah you’re ok with doing the SUB up front.

                  Unfortunately I couldn’t find how to get gcc to generate your code. I can get the same number of bytes (modulo the cmp #0 thing with -Os but it’s slower because it unconditionally branches back to the start of the function.

                  https://godbolt.org/z/43vTzqbeT

                  If I reduce the optimisation level then it doesn’t realise the load and store don’t affect the flags and does an extra compare:

                  https://godbolt.org/z/Gh6qWa835

                  Clang is worse and de-optimises it to use three subs and a push and pop for a stack frame at all opt levels.

                  Maybe I’ll install the latest SDCC and see what it generates for this f8 arch from C.

                  1. 1

                    I was mostly using ARM’s 16 bit instruction set quick reference card

                    I know the S suffix doesn’t correspond to a bit in the machine code for 16 bit instructions, but a good proportion of the ALU ops still come in both S and non-S versions (including MOV) so it feels pretty familiar to me. Which was of course the point of Thumb :-)

                    1. 2

                      So I downloaded the latest SDCC and compiled the C code using –opt-code-size

                        000000                         52 _mcpy:
                        000000 EA FA            [ 1]   53         addw    sp, #-6
                                                       54 ;       memcpy.c: 2: char *ret = dst, *limit = src + sz;
                        000002 CC               [ 2]   55         ldw     z, y
                        000003 C9 00            [ 1]   56         ldw     (0, sp), y
                        000005 C2 08            [ 1]   57         ldw     y, (8, sp)
                        000007 7A 0A            [ 1]   58         addw    y, (10, sp)
                        000009 C9 02            [ 1]   59         ldw     (2, sp), y
                                                       60 ;       memcpy.c: 3: while (src < limit) *dst++ = *src++;
                        00000B C2 08            [ 1]   61         ldw     y, (8, sp)
                        00000D C9 04            [ 1]   62         ldw     (4, sp), y
                        00000F                         63 00101$:
                        00000F C2 04            [ 1]   64         ldw     y, (4, sp)
                        000011 72 02            [ 2]   65         subw    y, (2, sp)
                        000013 D4 0E            [ 1]   66         jrc     #00103$
                        000015 C2 04            [ 1]   67         ldw     y, (4, sp)
                        000017 84               [ 1]   68         ld      xl, (y)
                        000018 A5 04            [ 1]   69         incw    (4, sp)
                        00001A 8D 00 00         [ 5]   70         ld      (0, z), xl
                        00001D 9E A7            [ 1]   71         incw    z
                        00001F D0 F0            [ 1]   72         jr      #00101$
                        000021                         73 00103$:
                                                       74 ;       memcpy.c: 4: return ret;
                        000021 C2 00            [ 1]   75         ldw     y, (0, sp)
                                                       76 ;       memcpy.c: 5: }
                        000023 EA 06            [ 1]   77         addw    sp, #6
                        000025 BA               [ 1]   78         ret
                      

                      That’s 38 bytes of code, so quite a lot worse than any Arm or RISC-V. As expected it uses xl for the byte being copied, uses z only for the current updated value of dst, and uses y both for src and to compare src to limit – which it can only do by destructively subtracting them.

                      We can tune the C source a bit for this compiler, but not all that much … still 32 bytes…

                        000000                         52 _mcpy:
                        000000 EA FC            [ 1]   53         addw    sp, #-4
                                                       54 ;       memcpy.c: 2: char *ret = dst;
                        000002 C9 00            [ 1]   55         ldw     (0, sp), y
                                                       56 ;       memcpy.c: 3: while (sz) {
                        000004 9E C2 06         [ 1]   57         ldw     z, (6, sp)
                        000007 C9 02            [ 1]   58         ldw     (2, sp), y
                        000009                         59 00101$:
                        000009 B5 08            [ 1]   60         tstw    (8, sp)
                        00000B D2 10            [ 1]   61         jrz     #00103$
                                                       62 ;       memcpy.c: 4: *dst++ = *src++;
                        00000D 83 00 00         [ 9]   63         ld      xl, (0, z)
                        000010 9E A7            [ 1]   64         incw    z
                        000012 C2 02            [ 1]   65         ldw     y, (2, sp)
                        000014 8E               [10]   66         ld      (y), xl
                        000015 A5 02            [ 1]   67         incw    (2, sp)
                                                       68 ;       memcpy.c: 5: --sz;
                        000017 F7 08            [ 1]   69         decw    (8, sp)
                        000019 D0 F0            [ 1]   70         jr      #00101$
                        00001B                         71 00103$:
                                                       72 ;       memcpy.c: 7: return ret;
                        00001B C2 00            [ 1]   73         ldw     y, (0, sp)
                                                       74 ;       memcpy.c: 8: }
                        00001D EA 04            [ 1]   75         addw    sp, #4
                        00001F BA               [ 1]   76         ret
                      

                      Hand coding asm could improve this a bit, but not to the gcc results for Arm and RISC-V.

                      While we’ve got SDCC let’s try its code generation for some other 8 bit microprocessors … I won’t show 6502 76 bytes, 8051 81 bytes, hc08 80 bytes, or stm8 39 bytes but here is z80 32 bytes (same as f8):

                      00000000                         47 _mcpy::
                      00000000 DD E5            [15]   48         push    ix
                      00000002 DD 21 00 00      [14]   49         ld      ix,#0
                      00000006 DD 39            [15]   50         add     ix,sp
                                                       51 ;memcpy.c:2: char *ret = dst;
                      00000008 E5               [11]   52         push    hl
                                                       53 ;memcpy.c:3: while (sz) {
                      00000009 DD 4E 04         [19]   54         ld      c, 4 (ix)
                      0000000C DD 46 05         [19]   55         ld      b, 5 (ix)
                      0000000F                         56 00101$:
                      0000000F 78               [ 4]   57         ld      a, b
                      00000010 B1               [ 4]   58         or      a, c
                      00000011 28 07            [12]   59         jr      Z, 00103$
                                                       60 ;memcpy.c:4: *dst++ = *src++;
                      00000013 1A               [ 7]   61         ld      a, (de)
                      00000014 13               [ 6]   62         inc     de
                      00000015 77               [ 7]   63         ld      (hl), a
                      00000016 23               [ 6]   64         inc     hl
                                                       65 ;memcpy.c:5: --sz;
                      00000017 0B               [ 6]   66         dec     bc
                      00000018 18 F5            [12]   67         jr      00101$
                      0000001A                         68 00103$:
                                                       69 ;memcpy.c:7: return ret;
                      0000001A D1               [10]   70         pop     de
                                                       71 ;memcpy.c:8: }
                      0000001B DD E1            [14]   72         pop     ix
                      0000001D E1               [10]   73         pop     hl
                      0000001E F1               [10]   74         pop     af
                      0000001F E9               [ 4]   75         jp      (hl)
                      
                      1. 3

                        Here’s a reasonably optimal hand-written 6502 memcpy, at 43 bytes

                                ;; args dst in X,A  src, sz in ZP
                                ;; return value in X,A
                                
                                dstl = 0
                                dsth = 1
                                srcl = 2
                                srch = 3
                                szl = 4
                                szh = 5
                                reth = 6
                        
                                .area CODE
                        _mcpy:
                                sta dstl
                                stx dsth
                                stx reth
                                ldy #0
                                ldx szh
                                beq skip_pages
                        copy_pages:
                                lda (srcl),y
                                sta (dstl),y
                                iny
                                bne copy_pages
                                inc srch
                                inc dsth
                                dex
                                bne copy_pages
                        skip_pages:
                                ldx szl
                                beq done
                        copy_tail:
                                lda (srcl),y
                                sta (dstl),y
                                iny
                                dex
                                bne copy_tail
                        done:
                                lda dstl
                                ldx reth
                                rts
                        
            2. 1

              Fair enough on my code, but you can still save an a byte by doing puls d,pc at the end instead of ldd ,s++ ; ret.

              1. 1

                Ah yes, good point. PC is the first register pushed and the last popped, so that will work. In contrast to ARM where LR is the last register pushed and PC the first popped, so you have to do them all at once, you can’t push extras later, after the LR/PC (or else you’d have to do multiple pops)

                1. 1

                  ARM’s LDM / STM instructions always map higher register numbers to higher addresses. When you STMFD the LR is pushed first to the highest address, and when you LDMFD the PC is popped last.

                  https://developer.arm.com/documentation/dui0552/a/the-cortex-m3-instruction-set/memory-access-instructions/ldm-and-stm

                  (ARM seems to have a separate copy of their instruction set reference for each core, and lots of them don’t explain this point!)

                  1. 2

                    Oh, thanks for that! I had ARM7DTMI reference handy and could not find a statement on the order in memory, and have never had to care as I haven’t mixed mismatched push/pop. I asked chatgpt and it led me astray :-(

            3. 1

              It also made a pretty fair Forth processor with the dual stacks. If that’s your particular kink.

            4. 2

              install tex and learn how to find and install the missing “bytefield” package

              Or use Tectonic which is a single executable and will fetch all dependencies on-demand.

              1. 1

                Good to know if I ever need to use TeX again, which seems unlikely :-) It postdates my time at university, and asciidoc/markdown and friends seem to do the job for the things I touch these days.

              2. 1

                Ew. I thought new, extremely register starved ISAs (even 8-bit ones) became passe the minute AVR showed up.

                This is basically DG NOVA

                Great call.

                1. 2

                  I was using the 32 bit Eclipse MV – still with just 4 GPRs – from late 1984 until the early 90s. Mostly in PL/I, but from time to time reading or writing a little assembly language. They were a strong competitor to VAX despite the minimalist ISA, just as NOVA had been to PDP-11.

                  1. 1

                    In the mid-00s, a buddy of mine owned some colo facilities and was giving me a tour of one of them, and in the middle of one of the good size cages was a fair sized DG Eclipse install, couple of hundred sq ft, complete with 9-tracks, a row of washing machine disk drives and a couple of Dasher terminals[1]. Buddy said it was a “financial client” and it ran a mission critical app that they’d been trying to decom for decades but had consistently failed, and when they shut down the DC it was living in the other corp DCs wouldn’t take it on, so it ended up there. No idea when they finally turned it off.

                    [1] Sorry…I don’t know the DGs well enough to know what all of the kit was by sight. I was a DEC guy :-).

                    1. 2

                      I used VAX at university (only 780 and 750 in those days), and DG when I got into the real world. They were pretty popular. They claimed 2x price/performance over VAX, and I wouldn’t dispute that.

                      My first task when I left university and joined a stockbroking / FX / bonds / M&A company as their first in-house programmer was to choose a compiler (for like $20k or so I think) for their 2 MB (4?) MV10000. Previously they’d only been running the off-the-shelf (or at least vendor-customised) MOCOM financial software which was written in interpreted (bytecode) ICOBOL.

                      My choices were COBOL, FORTRAN, or PL/I. Those were the days.

                      Oh, just found an old reference to MOCOM:

                      https://www.afr.com/politics/mocom-in-london-19911007-k4mmu

                      https://www.icobol.com/products/icobol.shtml

              3. 5

                A bit of an unfortunate name since the Fairchild F8 already exists. But I see someone in the audience pointed this out as well.