Threads for pcy

  1. 4

    A couple of commonly used ways to improve the sine computation with power series:

    • Either use Chebyshev polynomials instead of Taylor polynomials, or use trigonometric rules to reduce the evaluation of sines and cosines to sin(x), with -pi/2 < x < pi/2. Taylor polynomials provide a better approximation around x = 0 while Chebyshev polynomials are a pretty good approximation on the whole domain.
    • Use Horner’s method for evaluating polynomials

    I recently saw a trick that improves accuracy by using two floating point numbers to store the value of pi. The bigger one stores an approximation to pi, and the smaller one stores the error in the approximation. When x is near pi, (x - big) - small is a much more precise approximation than x - big due to the way floating point numbers work.

    Ah, found the reference (I think it was posted on lobsters as well): http://mooooo.ooo/chebyshev-sine-approximation/

    1. 3

      Chebyshev polynomials (or any other kind of polynomial usable as base for solving eg. a least-squares or minimax fit over [-pi/2, pi/2]) are definitely a very good point, but, Horner’s method isn’t always ideal for evaluation, as it tends to ignore the fact that the CPU is pipelined: having each term depend on the previous one will make the CPU unable to perform multiple steps in parallel(ish), slowing down the overall computation. see also. (SIMD might also help as well here.)

      1. 2

        Good point. Thanks for the link, it’s a good read :) I’ve always considered pipelining to be a black art and largely ignored it.

        1. 2

          Using the scheme explained in https://en.wikipedia.org/wiki/Estrin%27s_scheme speeds up the code from your reference significantly. (This link was posted in the comments, but it deserves more attention!).

      1. 0

        Undocumented opcodes? Please call them bugs. Anything outside the documented ISA should be an illegal instruction exception. If it’s not documented and it doesn’t cause an illegal instruction exception, it is a bug.

        This kind of thing is a major factor of x86 and derivatives making me sick. It is good RISC-V is taking off.

        1. 4

          To be fair these aren’t special to x86 — The undocumented/illegal instructions in 6510 NMOS CPUs are (relatively) well-known, and they aren’t alone in this, at all.

          And RISC-V doesn’t seem to be free from this either: while anyone can implement the spec freely, and some implementations are also freely available, it’s still incredibly hard to figure out what the actual hardware could be doing.

          1. 5

            I strongly bet most RISC-V implementations will end up being undocumented and put into deep embedded microcontroller scenarios. That’s where the money is (including the money that doesn’t want to give ARM money), and where almost all RISC-V interest is coming from.

            1. 2

              Yes, these bugs aren’t limited to x86, but x86 is a currently wide-deployed architecture that suffers from this problem. x86 is extremely complex, full of historical baggage.

              RISC-V implementations can indeed have undocumented instructions, but due to the extensible instruction space, this is less likely.

              I also do honestly expect most hardware to use the royalty-free OSHW cores, as they are very good. These are unlikely to have undocumented instructions, due to their open nature.

          1. 8

            Out of control or supply/demand? If no one’s making them anymore, $100 for a 30 yr old piece of electronics in good working condition seems totally plausible.

            1. 6

              Just clickbaity article.

              Ipod classics had a much higher price spike for less meaningful reasons.

              1. 5

                That’s $100 “as-is” or broken.

                1. 1

                  It still doesn’t strike me as a completely bonkers price for not just classic but iconic computing gear, even if it needs some repair.

                  Once the domain of basement dwellers, retro computing and retro gaming really have gone mainstream over the last decade. It wasn’t all that long ago, you could pick up a dusty but working Apple II or classic Mac at a garage sale for $5 or free as long as you agreed to also take an armful of toddler clothes on your way out. Now, most of the surviving ones in are collections and you’re not going to get working system with all of the peripherals for under the cost of a brand-new computer.

                  (I have direct experience with the retro gaming craze after having recently sold off my SNES games that I bought new when I was a teen. I never dreamed several of those games would be worth hundreds of bucks a piece on Ebay!)

                2. 2

                  Agreed. In my view, the price of C64 equipment reflects the availability of working machines and the community of buyers. Compare recent C64 auctions with recent Apple2 auctions on eBay. I find that the prices of Apple products sold in North America are consistently higher than any other “retro” brand.

                  1. 2

                    I feel like retro gaming and retro computing are now their own style. Bought purely for the aesthetics of the period and false nostalgia. Like steampunk aficionados buying up old electronic analog meters and brass plumbing fixtures for several times the actual value on eBay, they don’t care if it actually works as long as it looks cool.

                    1. 2

                      Tbf there are still people buying them who just want to hack on them, without the nostalgia filter, such as me. (I’m much younger than the C64, so I can’t exactly be doing it for nostalgia reasons.)

                      1. 1

                        I am sure that this is true for some folks but there are many, many people who are looking for working gear. Also, I see nothing wrong with buying a broken computer because you enjoy the aesthetics. It’s better than seeing these materials tossed in the trash.

                  1. 1

                    As the first figure under “Finding the inverse” is a relatively simple closed-loop system, wouldn’t a typical control theory technique have worked here, without having to manually mess with that one image?

                    (EDIT: yes, sorry, I’m too lazy to properly work something out, I know.)

                    1. 2

                      EB0000EA31C048FFC8 can be used to distinguish x86, x86_64 and ARM.

                      1. 13

                        14Kb for a 32 bit Windows executable that uses no C library functions and doesn’t need to parse any arguments? This is a real peeve of mine - 90% of the code in that program isn’t doing anything useful.

                        I added this to the end:

                        VOID __cdecl mainCRTStartup() { ExitProcess(main()); }

                        And compiled without the CRT (using Visual Studio 2008 compiler):

                        cl /Os sss.c /nodefaultlib kernel32.lib user32.lib

                        The resulting binary is 2048 bytes. Most of that is overhead too - it contains 134 bytes of code, and 210 bytes of read only data, which is mainly to specify DLL imports. That makes 344 bytes of 2048 that’s related to the program. This happens because there’s 1Kb of executable headers, and the two sections are aligned at 512 bytes, so the result is 2048.

                        1. 2

                          Normally I develop for embedded Linux so 14 kb is a lot there. But for windows, it hadn’t occurred to me that this is large. Which is to say, I never do windows development so the latter part of your comment doesn’t tell my anything. I understand that the size can be smaller, but why then is it so large by default from gcc?

                          1. 13

                            It’s large because main() isn’t the entrypoint. The entrypoint is a hidden function, mainCRTStartup, which calls main(). That function does things like parse the command line string from GetCommandLine() into argc/argv. But it also ends up implementing a global exception handler, piles of initialization for CRT provided global variables, TLS initialization, NLS support, initializing synchronization for multithreaded access to CRT state you’re not using, etc. Those features in turn end up needing DLL imports of their own, so your program has more functions to link against. And, those features will end up needing memory.

                            I don’t think this is really any different on Linux. CRT overhead exists, although it obviously varies depending on configuration. Since I’m using VS2008 for these tests, I can use its CRT as a DLL and I get a 5.5Kb binary, and statically linking I get a 36.5Kb binary. The latter is interesting just because it contains all the goop that the CRT is doing, making its true cost very visible. For example, this binary has 6140 bytes of writable global variables (of which your program has none), 6862 bytes of constant data, and it imports both VirtualAlloc and HeapAlloc so it’s going to allocate more memory on top of that. Heck, it imports 57 functions from kernel32.dll, and those function names are going to be part of the 6862 bytes of constant data. Part of those are LoadLibrary/GetProcAddress, so it can load more than just what the import table specifies; grovelling through the binary shows it can load .NET (mscoree.dll) on demand.

                            After writing that first comment, I was thinking a bit more about memory. One other non-obvious thing is that a console program on Windows is really a superset of a GUI program, because it needs to keep a corresponding conhost process, which this program makes invisible, but it’s still there. A GUI program doesn’t need that, which also eliminates two functions from the import table and some code to invoke them. After making that change (/subsystem:windows), it’s now 100 bytes of code and 138 bytes of data, with exactly one import from kernel32 (Sleep) and one from user32 (SendInput.) This program ends up with a private working set of 344 Kb on my system (although that value is essentially all kernel32/user32 so it’s heavily OS version dependent.)

                            Also, with all of the random CRT functions removed, it becomes possible to determine how much stack space it needs and shrink the stack. On my system, I can run this program with a 4Kb stack, which is insane, but the result is a commit of 528Kb.

                            1. 1

                              Thank you for the comprehensive explanation. I’m not sure if I can replicate it just yet, but I’m going to experiment a bit more on Windows.

                          2. 1

                            Aside from this, defining WIN32_LEAN_AND_MEAN, WIN32_EXTRA_LEAN, VC_LEANMEAN and VC_EXTRALEAN might help, and if you want to go completely overkill, you can always use an executable packer

                          1. 1

                            Hm, I’d expect there to be plenty of resources to read already (not from the EU, of course).

                            In my experience, having a beefy enough GPU to not have to wait ages to achieve some results seems to be more difficult.

                            1. 8

                              Honestly,

                              not.

                              1. 1

                                @pcy@icosahedron.website, demoscene and related stuff (which often ends up being reverse-engineering)

                                1. 17

                                  Next on lobste.rs: beating C with a custom massively parallel word-counting FPGA. There is no instruction set. It just dumps three integers onto the bus after scanning all of memory until it gets an address fault.

                                  1. 3

                                    I dare you or the other lobsters here.

                                    1. 3

                                      Well if you insist…

                                      Just pulled it out of my arse, but I haven’t done Verilog in a long time, though (and I didn’t do much of it in the first place), plus I don’t have any FPGA or anything to test it on (and I’m too lazy to set up a simulator), so it most likely contains very bad bugs.

                                      1. 2

                                        I wish I could upvote you more than once. Cool!

                                    2. 2

                                      Then, IBM’s scientists write their submission atom by atom with their logo under it the same way. Forwards it to Guinness to grab some meaningless record.

                                    1. 2

                                      A dumpsterdived Thinkpad E560.

                                      1. 2

                                        okay, that’s cool.

                                      1. 1

                                        I hacked one together with pandoc and makefiles. ¯_(ツ)_/¯

                                        1. 2

                                          Figuring out how interrupts work on a completely undocumented processor. I got the basic instruction stuff working as of this friday.

                                          1. 1

                                            One very minor nitpick: the actual bible of GBA programming is GBATek, to which Tonc itself refers.

                                            1. 2

                                              Hmm…

                                              stuff.asm:

                                              section .text
                                              
                                              global _start
                                              _start:
                                              	inc rax ; SYS_write == 1
                                              	inc rdi ; stdout == 1
                                              	mov bl, al
                                              	mov edx, text.end - text
                                              	lea rsi, [rel text]
                                              
                                              %include "stuff2.asm"
                                              
                                              	mov al, 60
                                              	xor edi, edi
                                              	syscall
                                              
                                              text:
                                              	db "hello world!",10
                                              .end:
                                              

                                              stuff2.asm:

                                              	mov al, bl
                                              	syscall
                                              	; and so on...
                                              
                                              $ time nasm -felf64 -o stuff.o stuff.asm && cc -static -nostdlib -nostartfiles -o stuff.elf stuff.o
                                              real	0m18.635s
                                              user	0m18.464s
                                              sys	0m0.047s
                                              $ cat /proc/cpuinfo | grep "model name" | head -1
                                              model name	: Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz
                                              

                                              (I know, the linking part is not counted in the time invocation, but it happens within one second, so I don’t really care.)

                                              That’s longer than I would’ve expected. How well does GAS do it?

                                              stuff.S:

                                              .text
                                              
                                              .global _start
                                              _start:
                                              	inc %rax
                                              	inc %rdi
                                              	movb %al, %bl
                                              	movl $(text_end - text), %edx
                                              	lea text(%rip), %rsi
                                              
                                              #include "stuff2.S"
                                              
                                              	mov $60, %al
                                              	xorl %edi, %edi
                                              	syscall
                                              
                                              text:
                                              	.ascii "hello world!\n"
                                              text_end:
                                              

                                              stuff2.S:

                                              	mov %bl, %al
                                              	syscall
                                              	/* and so on... */
                                              
                                              $ time cc -c -o stuffG.o stuff.S && cc -static -nostdlib -nostartfiles -o stuffG.elf stuffG.o
                                              real	0m5.276s
                                              user	0m4.876s
                                              sys	0m0.383s
                                              

                                              EDIT: the binary is about 4.58 megs in size.

                                              1. 6

                                                The article that this one references as an example of what you should probably avoid hints at one “trick” that actually is a good idea: avoiding floating point.

                                                Small micrcocontrollers typically don’t have floating point units. As a result, all floating point operations are done in software. The article mentions the small space savings, but then neglects to mention the real savings: not linking in the floating point routines. Add to this the fact that floating point arithmetic is much more subtle than it looks, and not using floating point values is a win.

                                                1. 2

                                                  Another reason is to be able to compare results across machines/setups/… without having to mess with an ‘epsilon’ value etc. If such comparisons aren’t needed, one can simply calculate a hash locally, send those over the network and compare that, instead of sending over all the data.

                                                1. 2

                                                  Re: figuring out code size: a really helpful tool in this regard (in my experience) is the linker map (-Wl,-Map=outfile.map), but it’s very very verbose, but also detailed.

                                                  If you don’t need the features, -fno-stack-protector -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -fno-exceptions -fno-threadsafe-statics can also yield a few gains. If you’re doing floating point stuff and have an FPU available, make sure it’s actually being used: -marm -mfloat-abi=hard -march=armv7-a+neon+vfpv3, otherwise the compiler will generate softfloat code.

                                                  Also, if you can afford it, rewrite the crt* code and reimplement a few stdlib functions that can be made much smaller (eg. the newlib strcmp seems to be made for performance and thus needs to do some unaligned access magic, but it can also be done with a simple ldrb/cmp loop).

                                                  (EDIT: markup fix)

                                                  1. 2

                                                    Re: figuring out code size: a really helpful tool in this regard (in my experience) is the linker map (-Wl,-Map=outfile.map), but it’s very very verbose, but also detailed.

                                                    Cyril Fougeray wrote a post about the linker map file on Interrupt a few weeks ago FWIW: https://interrupt.memfault.com/blog/get-the-most-out-of-the-linker-map-file

                                                    If you don’t need the features, -fno-stack-protector -fomit-frame-pointer ’-fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -fno-exceptions -fno-threadsafe-statics

                                                    In our case most of those are disabled by default since we’re writing C code on an ARM MCU (fno-stack-protector, fomit-frame-pointer, fno-rtti, fno-exceptions, fno-threadsafe-statics). I didn’t think to check whether the unwind tables ended up in my bin file, I’ll look into it! In any case, your suggestions make a lot of sense in many contexts.

                                                    If you’re doing floating point stuff and have an FPU available, make sure it’s actually being used:-marm -mfloat-abi=hard -march=armv7-a+neon+vfpv3`, otherwise the compiler will generate softfloat code.

                                                    I initially had a section about FPU code, but my test code targets the cortex-m0 which does not have an FPU. The next blog post in the series will talk about floating point code.

                                                    Also, if you can afford it, rewrite the crt* code and reimplement a few stdlib functions that can be made much smaller (eg. the newlib strcmp seems to be made for performance and thus needs to do some unaligned access magic, but it can also be done with a simple ldrb/`cmp loop).

                                                    I file this under “desperate measures”. There are smaller implementations of libc than newlib-nano (note we’re not using standard newlib), but they’re much less robust.

                                                    1. 1

                                                      In our case most of those are disabled by default since we’re writing C code on an ARM MCU

                                                      I’ve also done stuff for an FPU-less ARM chip (good ol’ ARM7), but the toolchain still enabled these by default, so ¯\_(ツ)_/¯.

                                                      I file this under “desperate measures”. There are smaller implementations of libc than newlib-nano (note we’re not using standard newlib), but they’re much less robust.

                                                      True, but I’ve been exposed to places where things like these (and even worse) were needed.

                                                  1. 3
                                                    1. Want to make something on an embedded device.
                                                    2. Realize there’s an extra undocumented processor on the SoC, and it’s faster than the main CPU!
                                                    3. Try to offload some work to that processor.
                                                    4. Make a compiler & toolchain for that processor, because there are none, eg. there isn’t a GCC target for it.
                                                    5. Make an assembler for that processor, to get at least some code running before having to port binutils+gcc.
                                                    6. Try to figure out what the instruction set of that processor is, because that’s — guess what — undocumented as well
                                                    7. Try to turn that processor on in the first place. Did you really expect this to be documented? Well have I got some bad news for you.
                                                    1. 7

                                                      To be fair, the only special tricks I can see here is ‘falling through’ the next function, and overlapping opcodes to skip over a piece of code, and used quite sparingly. The biggest “trick” is that it simply avoids doing a lot of stuff.

                                                      I’ve seen much more impressive stuff, like 3d graphics and image compression. (Source code included in their download archives. Also, the last link is NSFW.)

                                                      Nevertheless, it’s still awesome.

                                                      1. 1

                                                        I actually think it’s more awesome for not using special tricks. Designing a good set of features is far superior to “tricks” like say jumping to different offsets on the same byte stream to interpret it as different instructions.

                                                        1. 2

                                                          Depends on what you’re used to, I guess. Some of my work (well, “work”) requires a healthy^[debated] dose of tricks, so seeing more of those is always enlightening to me.

                                                      1. 3

                                                        If you want to see more, the archive at 16colo.rs contains a lot of artworks, both from the old days as well as contemporary ones. (Yes, people are still making these.)

                                                        EDIT: the demoscene is still going as well. See demozoo or pouët