The presentation gives really absolutely zero information on the actual ISA, just some desirable features.
In order to find out what’s actually going on I had to clone their gitrhub repo, install tex and learn how to find and install the missing “bytefield” package, and generate a PDF. Or I could have read the tex sources I guess but ick.
So, in short, there’s an 8 bit status register and 16 bit PC, SP, and X / Y / Z registers. The latter three of which can be split into H and L halves for 8 bit work. The 8 bit instructions generally use XL as an accumulator, and similarly X for 16 bit instructions. There are some prefix bytes available to tell it to use something else as the accumulator. There are 8 and 16 bit immediate/offsets. So basic instructions are one byte, optionally plus 1 or 2 immediate, optionally plus a prefix byte.
There are no code examples (at least in the manual), and no justification that the result gives more compact code overall than 8051 or RISC-V or Arm Thumb{2} or M6809.
One thing I can say immediately is there aren’t enough registers to compile …
… (or any other similar formulation) without spilling values to the stack. In fact you can’t even code the while loop as it needs all three 16 bit registers leaving nowhere to load/store the actual data to, and there is no ld (x),(y) or similar instruction (at most one operand is in memory).
So from my point of view, as someone who has tried designing his own ISA with similar goals, this fails at the first hurdle – if you can’t efficiently implement the functions in string.h then it’s just a fail before you even look at anything else.
The best you can do here is the 6502 trick of nested loops and put the low 8 bits of n in, say, XH, and use XL to load/store the character data, with a DJNZ on XH, and an outer loop that loads the upper 8 bits of n from the stack, decrements it if it’s not zero, and starts another inner loop copying up to 256 bytes. Using Y and Z fr src and dst pointers, obviously.
That will be fast, but it’s big code because of the nested loops and needing to spill and repeatedly reload the upper bits of n. You also need to spill the original value of dst so you can return it as the function result at the end (this is seldom used in practice, but the memcpy() spec requires it.
It’s not going to be easy to get a C compiler to automatically make the transformation to nested loops. Probably the best that can be done with the given registers is to store limit on the stack and use a compare instruction between it and the src register. But 16 bit compares are only with constants. Sooo … I think the best that can be done is xchw y, (n, sp) to swap dst and limit or dst and n between a register and a stack location.
I think the minimum viable register set for reasonable compiled code is four 16 bit GPRs, all usable for integer calculations, at least two usable as pointers (but preferably three so you can do *x = *y + *z). Preferable plus a separate SP. And of course PC.
This is basically DG NOVA. which only had the 4 GPRs (any of which could be used as a stack pointer), and ECLIPSE which added a dedicated SP.
I’d like to see some significant examples of compiled C, because I have very strong doubts that in practise this ISA is going to give more compact code in real-world situations than RV32EC (in the $0.10 CH32V003) or MSP430. or Cortex-M0+ (in RP2040, or in the also $0.10 PY32F002A).
Wow! Seems like they did a bad job of recreating the 6809, which has two 8-bit accumulators (which can be combined into a 16-bit accumulator), and four 16-bit index registers. Your memcpy() function is just:
;--------------
; X - src
; U - dest
; Y - count
;---------------
memcpy lda ,x+
sta ,u+
leay -1,y
bne memcpy
rts
In fact, the 6809 was designed to be an 8-bit CPU that could be targetted by higher level languages like C and Pascal.
Yup, the 6809 – which I was using back in 1983, and we created a BCPL back end for – meets my “minimum useful registers” spec, with four pointer registers (you can even repurpose S temporarily if you turn off interrupts and have some global place to stash it).
Your lda, sta, and leay all need 1 indexing postbyte, and bne needs an offset, so you’ve got 9 bytes of code there.
But you don’t quite meet the memcpy spec. Your code doesn’t work if count is 0, so you’ll need an extra leay 0,y and beq ret. Also you don’t return the original dst. Adding stu ,--s at the start and ldd ,s++ (or some other 2 byte register) before the rts will fix that, for a total of 17 bytes of code.
ARMv7 needs 22 bytes for the same function, and RISC-V RV32IC needs 24 bytes.
The RISC-V Zcb extension found in e.g. the RP2350 (Pi Pico 2) and the RVA23 profile allows 20 bytes by giving 2-byte instructions for lbu and sb with 0 offset:
All instructions are 2 bytes except the bgtu a2,a1,loop which is 4 bytes.
So that’s bigger than the corrected 6809 function, but not much! Just 3 bytes. One of those comes from ret being just one byte. The others essentially come from the autoincrement addressing modes.
Nice, and ok for memcpy() as the order is not defined.
But I was thinking of Cortex-M0 as the small area CPU that competes against 8 bit cores, and it doesn’t have cbz [1] and the ldrb will I think nuke the flags from the sub? Or no .. I haven’t grovelled around with 32 bit Arm for a while, used to do a lot of asm in ARM7TDMI days
This is a case where the indexed addressing that RISC-V currently lacks does make a difference, both in reducing the number of things that need to be incremented/decremented, and in leaving dst untouched
But it certainly reinforces the point that you don’t need an 8 bit CPU with byte-oriented instructions to get small code.
[1] But a newer version of it could, if desired, as it’a a 2-byte instruction.
If my ancient ARM knowledge is not deceiving me, I think the only normal instructions that affect the flags are comparison instructions and arithmetic instructions with the S bit set. (Or loading flags from memory.) Controlling when the flags are set allows neat predicate chains in things like if statements, eg (in arm 32)
if (a != b && a != c)
stuff;
cmp a, b
cmpne a, c
beq else
stuff
else:
I haven’t played around with thumb2 IT instructions but they seem quite fun. On the other hand the breakeven point for predication vs branches is quite tight so there isn’t much incentive to get really silly with them :-)
Of course, replacing the cbz with cmp r2, #0; beq done only costs 2 more bytes so it’s still not too bad.
Once again, I’m looking at the small cores that compete with 8 bit, so Thumb1 / ARMv6-M. You don’t get a choice about an “S” bit on those. Instructions either set flags or don’t.
But on checking, you’re right. MOV sets flags but loads including pop don’t set flags so yeah you’re ok with doing the SUB up front.
Unfortunately I couldn’t find how to get gcc to generate your code. I can get the same number of bytes (modulo the cmp #0 thing with -Os but it’s slower because it unconditionally branches back to the start of the function.
I know the S suffix doesn’t correspond to a bit in the machine code for 16 bit instructions, but a good proportion of the ALU ops still come in both S and non-S versions (including MOV) so it feels pretty familiar to me. Which was of course the point of Thumb :-)
That’s 38 bytes of code, so quite a lot worse than any Arm or RISC-V. As expected it uses xl for the byte being copied, uses z only for the current updated value of dst, and uses y both for src and to compare src to limit – which it can only do by destructively subtracting them.
We can tune the C source a bit for this compiler, but not all that much … still 32 bytes…
Hand coding asm could improve this a bit, but not to the gcc results for Arm and RISC-V.
While we’ve got SDCC let’s try its code generation for some other 8 bit microprocessors … I won’t show 6502 76 bytes, 8051 81 bytes, hc08 80 bytes, or stm8 39 bytes but here is z80 32 bytes (same as f8):
00000000 47 _mcpy::
00000000 DD E5 [15] 48 push ix
00000002 DD 21 00 00 [14] 49 ld ix,#0
00000006 DD 39 [15] 50 add ix,sp
51 ;memcpy.c:2: char *ret = dst;
00000008 E5 [11] 52 push hl
53 ;memcpy.c:3: while (sz) {
00000009 DD 4E 04 [19] 54 ld c, 4 (ix)
0000000C DD 46 05 [19] 55 ld b, 5 (ix)
0000000F 56 00101$:
0000000F 78 [ 4] 57 ld a, b
00000010 B1 [ 4] 58 or a, c
00000011 28 07 [12] 59 jr Z, 00103$
60 ;memcpy.c:4: *dst++ = *src++;
00000013 1A [ 7] 61 ld a, (de)
00000014 13 [ 6] 62 inc de
00000015 77 [ 7] 63 ld (hl), a
00000016 23 [ 6] 64 inc hl
65 ;memcpy.c:5: --sz;
00000017 0B [ 6] 66 dec bc
00000018 18 F5 [12] 67 jr 00101$
0000001A 68 00103$:
69 ;memcpy.c:7: return ret;
0000001A D1 [10] 70 pop de
71 ;memcpy.c:8: }
0000001B DD E1 [14] 72 pop ix
0000001D E1 [10] 73 pop hl
0000001E F1 [10] 74 pop af
0000001F E9 [ 4] 75 jp (hl)
Ah yes, good point. PC is the first register pushed and the last popped, so that will work. In contrast to ARM where LR is the last register pushed and PC the first popped, so you have to do them all at once, you can’t push extras later, after the LR/PC (or else you’d have to do multiple pops)
ARM’s LDM / STM instructions always map higher register numbers to higher addresses. When you STMFD the LR is pushed first to the highest address, and when you LDMFD the PC is popped last.
Oh, thanks for that! I had ARM7DTMI reference handy and could not find a statement on the order in memory, and have never had to care as I haven’t mixed mismatched push/pop. I asked chatgpt and it led me astray :-(
Good to know if I ever need to use TeX again, which seems unlikely :-) It postdates my time at university, and asciidoc/markdown and friends seem to do the job for the things I touch these days.
I was using the 32 bit Eclipse MV – still with just 4 GPRs – from late 1984 until the early 90s. Mostly in PL/I, but from time to time reading or writing a little assembly language. They were a strong competitor to VAX despite the minimalist ISA, just as NOVA had been to PDP-11.
In the mid-00s, a buddy of mine owned some colo facilities and was giving me a tour of one of them, and in the middle of one of the good size cages was a fair sized DG Eclipse install, couple of hundred sq ft, complete with 9-tracks, a row of washing machine disk drives and a couple of Dasher terminals[1]. Buddy said it was a “financial client” and it ran a mission critical app that they’d been trying to decom for decades but had consistently failed, and when they shut down the DC it was living in the other corp DCs wouldn’t take it on, so it ended up there. No idea when they finally turned it off.
[1] Sorry…I don’t know the DGs well enough to know what all of the kit was by sight. I was a DEC guy :-).
I used VAX at university (only 780 and 750 in those days), and DG when I got into the real world. They were pretty popular. They claimed 2x price/performance over VAX, and I wouldn’t dispute that.
My first task when I left university and joined a stockbroking / FX / bonds / M&A company as their first in-house programmer was to choose a compiler (for like $20k or so I think) for their 2 MB (4?) MV10000. Previously they’d only been running the off-the-shelf (or at least vendor-customised) MOCOM financial software which was written in interpreted (bytecode) ICOBOL.
My choices were COBOL, FORTRAN, or PL/I. Those were the days.
The presentation gives really absolutely zero information on the actual ISA, just some desirable features.
In order to find out what’s actually going on I had to clone their gitrhub repo, install tex and learn how to find and install the missing “bytefield” package, and generate a PDF. Or I could have read the tex sources I guess but ick.
So, in short, there’s an 8 bit status register and 16 bit PC, SP, and X / Y / Z registers. The latter three of which can be split into H and L halves for 8 bit work. The 8 bit instructions generally use XL as an accumulator, and similarly X for 16 bit instructions. There are some prefix bytes available to tell it to use something else as the accumulator. There are 8 and 16 bit immediate/offsets. So basic instructions are one byte, optionally plus 1 or 2 immediate, optionally plus a prefix byte.
There are no code examples (at least in the manual), and no justification that the result gives more compact code overall than 8051 or RISC-V or Arm Thumb{2} or M6809.
One thing I can say immediately is there aren’t enough registers to compile …
… (or any other similar formulation) without spilling values to the stack. In fact you can’t even code the while loop as it needs all three 16 bit registers leaving nowhere to load/store the actual data to, and there is no
ld (x),(y)or similar instruction (at most one operand is in memory).So from my point of view, as someone who has tried designing his own ISA with similar goals, this fails at the first hurdle – if you can’t efficiently implement the functions in string.h then it’s just a fail before you even look at anything else.
The best you can do here is the 6502 trick of nested loops and put the low 8 bits of n in, say, XH, and use XL to load/store the character data, with a DJNZ on XH, and an outer loop that loads the upper 8 bits of n from the stack, decrements it if it’s not zero, and starts another inner loop copying up to 256 bytes. Using Y and Z fr src and dst pointers, obviously.
That will be fast, but it’s big code because of the nested loops and needing to spill and repeatedly reload the upper bits of n. You also need to spill the original value of dst so you can return it as the function result at the end (this is seldom used in practice, but the memcpy() spec requires it.
It’s not going to be easy to get a C compiler to automatically make the transformation to nested loops. Probably the best that can be done with the given registers is to store
limiton the stack and use a compare instruction between it and thesrcregister. But 16 bit compares are only with constants. Sooo … I think the best that can be done isxchw y, (n, sp)to swap dst and limit or dst and n between a register and a stack location.I think the minimum viable register set for reasonable compiled code is four 16 bit GPRs, all usable for integer calculations, at least two usable as pointers (but preferably three so you can do *x = *y + *z). Preferable plus a separate SP. And of course PC.
This is basically DG NOVA. which only had the 4 GPRs (any of which could be used as a stack pointer), and ECLIPSE which added a dedicated SP.
I’d like to see some significant examples of compiled C, because I have very strong doubts that in practise this ISA is going to give more compact code in real-world situations than RV32EC (in the $0.10 CH32V003) or MSP430. or Cortex-M0+ (in RP2040, or in the also $0.10 PY32F002A).
Wow! Seems like they did a bad job of recreating the 6809, which has two 8-bit accumulators (which can be combined into a 16-bit accumulator), and four 16-bit index registers. Your
memcpy()function is just:In fact, the 6809 was designed to be an 8-bit CPU that could be targetted by higher level languages like C and Pascal.
Yup, the 6809 – which I was using back in 1983, and we created a BCPL back end for – meets my “minimum useful registers” spec, with four pointer registers (you can even repurpose S temporarily if you turn off interrupts and have some global place to stash it).
Your
lda,sta, andleayall need 1 indexing postbyte, andbneneeds an offset, so you’ve got 9 bytes of code there.But you don’t quite meet the memcpy spec. Your code doesn’t work if count is 0, so you’ll need an extra
leay 0,yandbeq ret. Also you don’t return the originaldst. Addingstu ,--sat the start andldd ,s++(or some other 2 byte register) before thertswill fix that, for a total of 17 bytes of code.ARMv7 needs 22 bytes for the same function, and RISC-V RV32IC needs 24 bytes.
The RISC-V Zcb extension found in e.g. the RP2350 (Pi Pico 2) and the RVA23 profile allows 20 bytes by giving 2-byte instructions for lbu and sb with 0 offset:
All instructions are 2 bytes except the
bgtu a2,a1,loopwhich is 4 bytes.So that’s bigger than the corrected 6809 function, but not much! Just 3 bytes. One of those comes from
retbeing just one byte. The others essentially come from the autoincrement addressing modes.I think Thumb 2 is 12 bytes?
Nice, and ok for memcpy() as the order is not defined.
But I was thinking of Cortex-M0 as the small area CPU that competes against 8 bit cores, and it doesn’t have
cbz[1] and theldrbwill I think nuke the flags from thesub? Or no .. I haven’t grovelled around with 32 bit Arm for a while, used to do a lot of asm in ARM7TDMI daysThis is a case where the indexed addressing that RISC-V currently lacks does make a difference, both in reducing the number of things that need to be incremented/decremented, and in leaving
dstuntouchedBut it certainly reinforces the point that you don’t need an 8 bit CPU with byte-oriented instructions to get small code.
[1] But a newer version of it could, if desired, as it’a a 2-byte instruction.
If my ancient ARM knowledge is not deceiving me, I think the only normal instructions that affect the flags are comparison instructions and arithmetic instructions with the S bit set. (Or loading flags from memory.) Controlling when the flags are set allows neat predicate chains in things like if statements, eg (in arm 32)
I haven’t played around with thumb2 IT instructions but they seem quite fun. On the other hand the breakeven point for predication vs branches is quite tight so there isn’t much incentive to get really silly with them :-)
Of course, replacing the cbz with cmp r2, #0; beq done only costs 2 more bytes so it’s still not too bad.
Once again, I’m looking at the small cores that compete with 8 bit, so Thumb1 / ARMv6-M. You don’t get a choice about an “S” bit on those. Instructions either set flags or don’t.
But on checking, you’re right. MOV sets flags but loads including pop don’t set flags so yeah you’re ok with doing the SUB up front.
Unfortunately I couldn’t find how to get gcc to generate your code. I can get the same number of bytes (modulo the
cmp #0thing with-Osbut it’s slower because it unconditionally branches back to the start of the function.https://godbolt.org/z/43vTzqbeT
If I reduce the optimisation level then it doesn’t realise the load and store don’t affect the flags and does an extra compare:
https://godbolt.org/z/Gh6qWa835
Clang is worse and de-optimises it to use three
subsand apushandpopfor a stack frame at all opt levels.Maybe I’ll install the latest SDCC and see what it generates for this f8 arch from C.
I was mostly using ARM’s 16 bit instruction set quick reference card
I know the S suffix doesn’t correspond to a bit in the machine code for 16 bit instructions, but a good proportion of the ALU ops still come in both S and non-S versions (including MOV) so it feels pretty familiar to me. Which was of course the point of Thumb :-)
So I downloaded the latest SDCC and compiled the C code using –opt-code-size
That’s 38 bytes of code, so quite a lot worse than any Arm or RISC-V. As expected it uses
xlfor the byte being copied, useszonly for the current updated value ofdst, and usesyboth forsrcand to comparesrctolimit– which it can only do by destructively subtracting them.We can tune the C source a bit for this compiler, but not all that much … still 32 bytes…
Hand coding asm could improve this a bit, but not to the gcc results for Arm and RISC-V.
While we’ve got SDCC let’s try its code generation for some other 8 bit microprocessors … I won’t show 6502 76 bytes, 8051 81 bytes, hc08 80 bytes, or stm8 39 bytes but here is z80 32 bytes (same as f8):
Here’s a reasonably optimal hand-written 6502 memcpy, at 43 bytes
Fair enough on my code, but you can still save an a byte by doing
puls d,pcat the end instead ofldd ,s++ ; ret.Ah yes, good point. PC is the first register pushed and the last popped, so that will work. In contrast to ARM where LR is the last register pushed and PC the first popped, so you have to do them all at once, you can’t push extras later, after the LR/PC (or else you’d have to do multiple pops)
ARM’s LDM / STM instructions always map higher register numbers to higher addresses. When you STMFD the LR is pushed first to the highest address, and when you LDMFD the PC is popped last.
https://developer.arm.com/documentation/dui0552/a/the-cortex-m3-instruction-set/memory-access-instructions/ldm-and-stm
(ARM seems to have a separate copy of their instruction set reference for each core, and lots of them don’t explain this point!)
Oh, thanks for that! I had ARM7DTMI reference handy and could not find a statement on the order in memory, and have never had to care as I haven’t mixed mismatched push/pop. I asked chatgpt and it led me astray :-(
It also made a pretty fair Forth processor with the dual stacks. If that’s your particular kink.
Or use Tectonic which is a single executable and will fetch all dependencies on-demand.
Good to know if I ever need to use TeX again, which seems unlikely :-) It postdates my time at university, and asciidoc/markdown and friends seem to do the job for the things I touch these days.
Ew. I thought new, extremely register starved ISAs (even 8-bit ones) became passe the minute AVR showed up.
This is basically DG NOVA
Great call.
I was using the 32 bit Eclipse MV – still with just 4 GPRs – from late 1984 until the early 90s. Mostly in PL/I, but from time to time reading or writing a little assembly language. They were a strong competitor to VAX despite the minimalist ISA, just as NOVA had been to PDP-11.
In the mid-00s, a buddy of mine owned some colo facilities and was giving me a tour of one of them, and in the middle of one of the good size cages was a fair sized DG Eclipse install, couple of hundred sq ft, complete with 9-tracks, a row of washing machine disk drives and a couple of Dasher terminals[1]. Buddy said it was a “financial client” and it ran a mission critical app that they’d been trying to decom for decades but had consistently failed, and when they shut down the DC it was living in the other corp DCs wouldn’t take it on, so it ended up there. No idea when they finally turned it off.
[1] Sorry…I don’t know the DGs well enough to know what all of the kit was by sight. I was a DEC guy :-).
I used VAX at university (only 780 and 750 in those days), and DG when I got into the real world. They were pretty popular. They claimed 2x price/performance over VAX, and I wouldn’t dispute that.
My first task when I left university and joined a stockbroking / FX / bonds / M&A company as their first in-house programmer was to choose a compiler (for like $20k or so I think) for their 2 MB (4?) MV10000. Previously they’d only been running the off-the-shelf (or at least vendor-customised) MOCOM financial software which was written in interpreted (bytecode) ICOBOL.
My choices were COBOL, FORTRAN, or PL/I. Those were the days.
Oh, just found an old reference to MOCOM:
https://www.afr.com/politics/mocom-in-london-19911007-k4mmu
https://www.icobol.com/products/icobol.shtml
A bit of an unfortunate name since the Fairchild F8 already exists. But I see someone in the audience pointed this out as well.