I really like stack machines. They are becoming relevant again in the current cheap FPGA revolution. I’ve been recently exploring stack machines in a Haskell code generator setting, where a parameterized CPU core, associated Forth-based machine code, and QuickCheck based test benches can all be hosted in a single Haskell program, which then generates Verilog and SRAM images to run on target. Very much work in progress, but it isn’t as hard or crazy as it sounds when you keep things simple.
Superscalar killed stack machines. Superscalar designs rely on being able to identify dependencies between instructions so that you can execute independent ones in parallel. Dependency analysis in a register machine is quite easy: instructions that don’t use the same registers are independent. It’s really, really hard to do with stack machines. (Almost) Every instruction modifies the top of the stack.
Since then, modern ISA designs have significantly improved encoding density. The one area where stack machines were a big win - program size (and therefore i-cache efficiency) - is largely gone.
That said, with the security problems associated with speculative execution and the programming languages being developed to better take advantage of multicore systems, I do wonder if there’s going to be a place in 10-15 years for many-core in-order processors. Either stack machines or simple multithreaded in-order cores are more efficient than big superscalar out-of-order machines if you can write software for them that keeps the cores busy.
There is such a place: deeply embedded programming, where you would use a tiny microcontroller that sits in the middle of custom gateware instead of a complex ad-hoc controller state machine. The CPU I was talking about is such a thing: it serves a single purpose: controlling a bunch of peripherals. In an FPGA with block SRAM, it still makes sense to keep instruction width and program size small so a program fits in say a single 256 byte block SRAM, and keep the CPU simple so you need less gates to implement it. The Forth machine I mentioned is so stripped down it doesn’t even have a full datapath. The only computation it performs is decrement and branch if zero, so it can do loops, and in this case it made sense to go back to a single stack that can be used for subroutine return addresses and loop counter. For the rest it can read/write registers. Surprising what you can do so little.
Wonderful! I believe my Fomu just found a weekend playdate.