Undocumented opcodes? Please call them bugs. Anything outside the documented ISA should be an illegal instruction exception. If it’s not documented and it doesn’t cause an illegal instruction exception, it is a bug.
This kind of thing is a major factor of x86 and derivatives making me sick. It is good RISC-V is taking off.
To be fair these aren’t special to x86 — The undocumented/illegal instructions in 6510 NMOS CPUs are (relatively) well-known, and they aren’t alone in this, at all.
And RISC-V doesn’t seem to be free from this either: while anyone can implement the spec freely, and some implementations are also freely available, it’s still incredibly hard to figure out what the actual hardware could be doing.
I strongly bet most RISC-V implementations will end up being undocumented and put into deep embedded microcontroller scenarios. That’s where the money is (including the money that doesn’t want to give ARM money), and where almost all RISC-V interest is coming from.
Yes, these bugs aren’t limited to x86, but x86 is a currently wide-deployed architecture that suffers from this problem. x86 is extremely complex, full of historical baggage.
RISC-V implementations can indeed have undocumented instructions, but due to the extensible instruction space, this is less likely.
I also do honestly expect most hardware to use the royalty-free OSHW cores, as they are very good. These are unlikely to have undocumented instructions, due to their open nature.
Out of control or supply/demand? If no one’s making them anymore, $100 for a 30 yr old piece of electronics in good working condition seems totally plausible.
It still doesn’t strike me as a completely bonkers price for not just classic but iconic computing gear, even if it needs some repair.
Once the domain of basement dwellers, retro computing and retro gaming really have gone mainstream over the last decade. It wasn’t all that long ago, you could pick up a dusty but working Apple II or classic Mac at a garage sale for $5 or free as long as you agreed to also take an armful of toddler clothes on your way out. Now, most of the surviving ones in are collections and you’re not going to get working system with all of the peripherals for under the cost of a brand-new computer.
(I have direct experience with the retro gaming craze after having recently sold off my SNES games that I bought new when I was a teen. I never dreamed several of those games would be worth hundreds of bucks a piece on Ebay!)
Agreed. In my view, the price of C64 equipment reflects the availability of working machines and the community of buyers. Compare recent C64 auctions with recent Apple2 auctions on eBay. I find that the prices of Apple products sold in North America are consistently higher than any other “retro” brand.
I feel like retro gaming and retro computing are now their own style. Bought purely for the aesthetics of the period and false nostalgia. Like steampunk aficionados buying up old electronic analog meters and brass plumbing fixtures for several times the actual value on eBay, they don’t care if it actually works as long as it looks cool.
Tbf there are still people buying them who just want to hack on them, without the nostalgia filter, such as me. (I’m much younger than the C64, so I can’t exactly be doing it for nostalgia reasons.)
I am sure that this is true for some folks but there are many, many people who are looking for working gear. Also, I see nothing wrong with buying a broken computer because you enjoy the aesthetics. It’s better than seeing these materials tossed in the trash.
As the first figure under “Finding the inverse” is a relatively simple closed-loop system, wouldn’t a typical control theory technique have worked here, without having to manually mess with that one image?
(EDIT: yes, sorry, I’m too lazy to properly work something out, I know.)
14Kb for a 32 bit Windows executable that uses no C library functions and doesn’t need to parse any arguments? This is a real peeve of mine - 90% of the code in that program isn’t doing anything useful.
I added this to the end:
VOID __cdecl mainCRTStartup() { ExitProcess(main()); }
And compiled without the CRT (using Visual Studio 2008 compiler):
cl /Os sss.c /nodefaultlib kernel32.lib user32.lib
The resulting binary is 2048 bytes. Most of that is overhead too - it contains 134 bytes of code, and 210 bytes of read only data, which is mainly to specify DLL imports. That makes 344 bytes of 2048 that’s related to the program. This happens because there’s 1Kb of executable headers, and the two sections are aligned at 512 bytes, so the result is 2048.
Normally I develop for embedded Linux so 14 kb is a lot there. But for windows, it hadn’t occurred to me that this is large. Which is to say, I never do windows development so the latter part of your comment doesn’t tell my anything. I understand that the size can be smaller, but why then is it so large by default from gcc?
It’s large because main() isn’t the entrypoint. The entrypoint is a hidden function, mainCRTStartup, which calls main(). That function does things like parse the command line string from GetCommandLine() into argc/argv. But it also ends up implementing a global exception handler, piles of initialization for CRT provided global variables, TLS initialization, NLS support, initializing synchronization for multithreaded access to CRT state you’re not using, etc. Those features in turn end up needing DLL imports of their own, so your program has more functions to link against. And, those features will end up needing memory.
I don’t think this is really any different on Linux. CRT overhead exists, although it obviously varies depending on configuration. Since I’m using VS2008 for these tests, I can use its CRT as a DLL and I get a 5.5Kb binary, and statically linking I get a 36.5Kb binary. The latter is interesting just because it contains all the goop that the CRT is doing, making its true cost very visible. For example, this binary has 6140 bytes of writable global variables (of which your program has none), 6862 bytes of constant data, and it imports both VirtualAlloc and HeapAlloc so it’s going to allocate more memory on top of that. Heck, it imports 57 functions from kernel32.dll, and those function names are going to be part of the 6862 bytes of constant data. Part of those are LoadLibrary/GetProcAddress, so it can load more than just what the import table specifies; grovelling through the binary shows it can load .NET (mscoree.dll) on demand.
After writing that first comment, I was thinking a bit more about memory. One other non-obvious thing is that a console program on Windows is really a superset of a GUI program, because it needs to keep a corresponding conhost process, which this program makes invisible, but it’s still there. A GUI program doesn’t need that, which also eliminates two functions from the import table and some code to invoke them. After making that change (/subsystem:windows), it’s now 100 bytes of code and 138 bytes of data, with exactly one import from kernel32 (Sleep) and one from user32 (SendInput.) This program ends up with a private working set of 344 Kb on my system (although that value is essentially all kernel32/user32 so it’s heavily OS version dependent.)
Also, with all of the random CRT functions removed, it becomes possible to determine how much stack space it needs and shrink the stack. On my system, I can run this program with a 4Kb stack, which is insane, but the result is a commit of 528Kb.
Thank you for the comprehensive explanation. I’m not sure if I can replicate it just yet, but I’m going to experiment a bit more on Windows.
Aside from this, defining WIN32_LEAN_AND_MEAN, WIN32_EXTRA_LEAN, VC_LEANMEAN
and VC_EXTRALEAN
might help, and if you want to go completely overkill, you can always use an executable packer
Hm, I’d expect there to be plenty of resources to read already (not from the EU, of course).
In my experience, having a beefy enough GPU to not have to wait ages to achieve some results seems to be more difficult.
@pcy@icosahedron.website, demoscene and related stuff (which often ends up being reverse-engineering)
Next on lobste.rs: beating C with a custom massively parallel word-counting FPGA. There is no instruction set. It just dumps three integers onto the bus after scanning all of memory until it gets an address fault.
Just pulled it out of my arse, but I haven’t done Verilog in a long time, though (and I didn’t do much of it in the first place), plus I don’t have any FPGA or anything to test it on (and I’m too lazy to set up a simulator), so it most likely contains very bad bugs.
Then, IBM’s scientists write their submission atom by atom with their logo under it the same way. Forwards it to Guinness to grab some meaningless record.
Figuring out how interrupts work on a completely undocumented processor. I got the basic instruction stuff working as of this friday.
One very minor nitpick: the actual bible of GBA programming is GBATek, to which Tonc itself refers.
Hmm…
stuff.asm
:
section .text
global _start
_start:
inc rax ; SYS_write == 1
inc rdi ; stdout == 1
mov bl, al
mov edx, text.end - text
lea rsi, [rel text]
%include "stuff2.asm"
mov al, 60
xor edi, edi
syscall
text:
db "hello world!",10
.end:
stuff2.asm
:
mov al, bl
syscall
; and so on...
$ time nasm -felf64 -o stuff.o stuff.asm && cc -static -nostdlib -nostartfiles -o stuff.elf stuff.o
real 0m18.635s
user 0m18.464s
sys 0m0.047s
$ cat /proc/cpuinfo | grep "model name" | head -1
model name : Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz
(I know, the linking part is not counted in the time
invocation, but it happens within one second, so I don’t really care.)
That’s longer than I would’ve expected. How well does GAS do it?
stuff.S
:
.text
.global _start
_start:
inc %rax
inc %rdi
movb %al, %bl
movl $(text_end - text), %edx
lea text(%rip), %rsi
#include "stuff2.S"
mov $60, %al
xorl %edi, %edi
syscall
text:
.ascii "hello world!\n"
text_end:
stuff2.S
:
mov %bl, %al
syscall
/* and so on... */
$ time cc -c -o stuffG.o stuff.S && cc -static -nostdlib -nostartfiles -o stuffG.elf stuffG.o
real 0m5.276s
user 0m4.876s
sys 0m0.383s
EDIT: the binary is about 4.58 megs in size.
The article that this one references as an example of what you should probably avoid hints at one “trick” that actually is a good idea: avoiding floating point.
Small micrcocontrollers typically don’t have floating point units. As a result, all floating point operations are done in software. The article mentions the small space savings, but then neglects to mention the real savings: not linking in the floating point routines. Add to this the fact that floating point arithmetic is much more subtle than it looks, and not using floating point values is a win.
Another reason is to be able to compare results across machines/setups/… without having to mess with an ‘epsilon’ value etc. If such comparisons aren’t needed, one can simply calculate a hash locally, send those over the network and compare that, instead of sending over all the data.
Re: figuring out code size: a really helpful tool in this regard (in my experience) is the linker map (-Wl,-Map=outfile.map
), but it’s very very verbose, but also detailed.
If you don’t need the features, -fno-stack-protector
-fomit-frame-pointer
-fno-unwind-tables
-fno-asynchronous-unwind-tables
-fno-rtti
-fno-exceptions
-fno-threadsafe-statics
can also yield a few gains. If you’re doing floating point stuff and have an FPU available, make sure it’s actually being used: -marm -mfloat-abi=hard -march=armv7-a+neon+vfpv3
, otherwise the compiler will generate softfloat code.
Also, if you can afford it, rewrite the crt*
code and reimplement a few stdlib functions that can be made much smaller (eg. the newlib strcmp
seems to be made for performance and thus needs to do some unaligned access magic, but it can also be done with a simple ldrb
/cmp
loop).
(EDIT: markup fix)
Re: figuring out code size: a really helpful tool in this regard (in my experience) is the linker map (-Wl,-Map=outfile.map), but it’s very very verbose, but also detailed.
Cyril Fougeray wrote a post about the linker map file on Interrupt a few weeks ago FWIW: https://interrupt.memfault.com/blog/get-the-most-out-of-the-linker-map-file
If you don’t need the features, -fno-stack-protector -fomit-frame-pointer ’-fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -fno-exceptions -fno-threadsafe-statics
In our case most of those are disabled by default since we’re writing C code on an ARM MCU (fno-stack-protector, fomit-frame-pointer, fno-rtti, fno-exceptions, fno-threadsafe-statics). I didn’t think to check whether the unwind tables ended up in my bin file, I’ll look into it! In any case, your suggestions make a lot of sense in many contexts.
If you’re doing floating point stuff and have an FPU available, make sure it’s actually being used:-marm -mfloat-abi=hard -march=armv7-a+neon+vfpv3`, otherwise the compiler will generate softfloat code.
I initially had a section about FPU code, but my test code targets the cortex-m0 which does not have an FPU. The next blog post in the series will talk about floating point code.
Also, if you can afford it, rewrite the crt* code and reimplement a few stdlib functions that can be made much smaller (eg. the newlib strcmp seems to be made for performance and thus needs to do some unaligned access magic, but it can also be done with a simple ldrb/`cmp loop).
I file this under “desperate measures”. There are smaller implementations of libc than newlib-nano (note we’re not using standard newlib), but they’re much less robust.
In our case most of those are disabled by default since we’re writing C code on an ARM MCU
I’ve also done stuff for an FPU-less ARM chip (good ol’ ARM7), but the toolchain still enabled these by default, so ¯\_(ツ)_/¯.
I file this under “desperate measures”. There are smaller implementations of libc than newlib-nano (note we’re not using standard newlib), but they’re much less robust.
True, but I’ve been exposed to places where things like these (and even worse) were needed.
To be fair, the only special tricks I can see here is ‘falling through’ the next function, and overlapping opcodes to skip over a piece of code, and used quite sparingly. The biggest “trick” is that it simply avoids doing a lot of stuff.
I’ve seen much more impressive stuff, like 3d graphics and image compression. (Source code included in their download archives. Also, the last link is NSFW.)
Nevertheless, it’s still awesome.
I actually think it’s more awesome for not using special tricks. Designing a good set of features is far superior to “tricks” like say jumping to different offsets on the same byte stream to interpret it as different instructions.
Depends on what you’re used to, I guess. Some of my work (well, “work”) requires a healthy^[debated] dose of tricks, so seeing more of those is always enlightening to me.
A couple of commonly used ways to improve the sine computation with power series:
I recently saw a trick that improves accuracy by using two floating point numbers to store the value of pi. The bigger one stores an approximation to pi, and the smaller one stores the error in the approximation. When x is near pi, (x - big) - small is a much more precise approximation than x - big due to the way floating point numbers work.
Ah, found the reference (I think it was posted on lobsters as well): http://mooooo.ooo/chebyshev-sine-approximation/
Chebyshev polynomials (or any other kind of polynomial usable as base for solving eg. a least-squares or minimax fit over [-pi/2, pi/2]) are definitely a very good point, but, Horner’s method isn’t always ideal for evaluation, as it tends to ignore the fact that the CPU is pipelined: having each term depend on the previous one will make the CPU unable to perform multiple steps in parallel(ish), slowing down the overall computation. see also. (SIMD might also help as well here.)
Good point. Thanks for the link, it’s a good read :) I’ve always considered pipelining to be a black art and largely ignored it.
Using the scheme explained in https://en.wikipedia.org/wiki/Estrin%27s_scheme speeds up the code from your reference significantly. (This link was posted in the comments, but it deserves more attention!).