1. 28
  1.  

  2. 16

    This is the first time a write a blogpost, let me know what you think :) I always feel like nothing I do is worth writing about but I wanted to share this one. Hopefuly the writing style is not too chaotic!

    1. 5

      I enjoyed the blog post, keep them coming! Not at all too chatoic.

      1. 4

        I really liked the tone of your post. I feel like a lot of blogs try to sound too “affected” (for lack of a better term). Your writing sounded very natural to me, so i really appreciated that.

        I love the game idea you are working on, can’t wait to see what you come up with

      2. 8

        I only skimmed, but:

        xchg rsp, QWORD [rcx]

        Please don’t do this. xchg imposes strong ordering requirements which you don’t want or need here.

        And:

        call next_inst

        next_inst:

        ; move the just save rip into the rip of the _co_data[coroutine_id]

        pop QWORD [rcx + 24]

        This is a pretty silly way to get the IP. You can grab the address in question using a rip-relative LEA. Or make a subroutine; take a look at the way libco does it.

        1. 6

          So I tested it without xchg (basically 3 movs) and it doesn’t give me any difference in apparent performance. (I’ve reduced the instruction limit of each coroutine to trigger more switches and increased the main loop count). Checking https://uops.info it seems that the latency in Zen architectures (my CPU) is lower than Intel architectures so maybe that’s the reason why?

          Checking with perf, with xchg the program has a lot more stalled cycles but it amount to aprox the same amount of clock time.

          ✦ 正 perf stat ./compiled_xcgh
                    8,042.55 msec task-clock:u
                 362,523,120      stalled-cycles-frontend:u        #    1.01% frontend cycles idle     (83.33%)
              25,187,208,178      stalled-cycles-backend:u         #   69.86% backend cycles idle      (83.33%)
          
          ✦ 正 perf stat ./compiled
                    8,555.72 msec task-clock:u
               1,247,901,814      stalled-cycles-frontend:u        #    3.50% frontend cycles idle     (83.32%)
                     354,218      stalled-cycles-backend:u         #    0.00% backend cycles idle      (83.34%)
          

          Any idea why the amount of stalled cycles is not affecting clock time that much?

          1. 1

            I suspect it will depend on the microarchitecture’s cache coherency implementation and the cache state of the affected memory page/line. Basically, if another CPU core also has the page/line in its L1 or L2, expect to pay a large latency penalty. In a single threaded program you won’t come across this of course.

          2. 4

            that’s good to know, I see now that it has an implicit lock, will change it tomorrow since I’m in bed already. I’ll also add the benchmark of xchg vs no xchg to show the impact.

            As I say in the beginning i’m very new with x86 asm so all recommendations are welcome :)

            Thanks!

            1. 4

              xchg imposes strong ordering requirements which you don’t want or need here.

              Are you sure? The only thing in the documentation is that this is atomic, which you get for free with any kind of vaguely modern cache coherency protocol because any store is in the exclusive state (modulo remote stores, which a few Intel microarchitectures do, but which aren’t useful in an exchange, irrespective of what instructions you use for it). Beyond that, the ordering guarantees are the same as the TSO guarantees that x86 has for all loads and stores and should not cause a performance impact. I’d expect a conformant implementation to do:

              1. Acquire cache line in exclusive state.
              2. Allocate rename register.
              3. Load value into rename register.
              4. Store value from source rename register.
              5. Release cache line into dirty state (may happen later).
              6. Release reference to source rename register.

              That’s more or less the same sequence that you’d see with a pair of mov instructions. If you need the result to end up in the source register then your going to be keeping more rename registers live and so the mov version might hurt in tight loops, where rename register pressure is often a killer, unless you explicitly zero the temporary register or have some other store to it before the end of the loop.

              1. 5

                Yes, it has stronger ordering guarantees than a plain write, which is why it’s used for seqcst stores under c/c++11. TSO doesn’t entail fencing; it still allows loads to be reordered.

              2. 2

                My own take on coroutines uses XCHG to swap the stack pointer (32 bit version and 64 bit version). What other sequence do you propose then?

                1. 3

                  You can use any free gpr as a temporary. Again: look at the way libco does it.

              3. 3

                I’ve read that a good rule for inserting yield-checks is to put one before every backward branch — because the only way for a routine to run for an unbounded time is to loop, and looping requires a backwards branch. If you support tail recursion you’d need to put a check before a tail-call too.

                You’re talking about using this in a game with programmable units, so I guess what you’re aiming for is a JIT. Is your plan to generate assembly and feed it through an assembler at runtime? Seems like it’d be awkward to do all that and extract the binary output. Or are you going to collect the machine instructions for every bytecode at build-time and string them together at runtime?

                (There was a post a few days ago about generating progressively more efficient interpreters; I guess you probably saw it?)

                1. 1

                  The backwards branch rule makes sense to me, depending on the instruction limit I probably also need to check for long chain of unconditional/non-branch instructions as well to avoid getting around the limit with a hacky loop unroll (the limits are very very small, so this could potentially happen I guess).

                  The game is programmable, but you program them ahead of time. The idea is you modify your units a little bit, play a match, refine them a little bit, match, and so on. Some ideas I have in mind for the game kind of break this model[1] but for now I’m going for this.

                  I also want to implement a way to expose behaviour variables for a unit that show in the UI, so you could create units that are flexible for many situations by clicking a couple of toggles that would modify the internal state of the unit.

                  Let me know if this is clear.

                  [1] One of the ideas I have in mind is to treat buildings as compilers/linkers, so you could potentially create units from composable modules you make yourself. By using the exposed variables you could potentially create a unit by coding together different presets and using the building code to stitch it / link it together.

                2. 2

                  How do we make sure that the unit vm only executes N instructions before it suspends itself? A pretty lightweight solution that requires very little setup is to insert a check that we haven’t exceeded the number of operations every K ops and after every label (in the unit VM you can only jump to labels, so we catch all loops, branches and sketchy stuff with this).

                  Love the post! very succinct & easy to read

                  Why not use interrupts? Seems like CHECK_LIMITS adds a lot of overhead if it’s required after every label — especially the small looping example. I’ve only used interrupts on AVRs (really easy to use!), so this may be a really naive question. But I’d imagine you may also be able to use it to estimate instructions ran :O

                  1. 1

                    A question to ask in return: How do you raise an interrupt (or a signal, for that matter, I will use the term exchangably) after a certain number of VM instructions have executed?

                    Assuming you get access to an interrupt, it will probably be raised after some time duration has passed or after the VM host has executed a certain number of native instructions. If I understand correctly, the author wants to check against a maximum of guest instructions executed. It will be difficult to find that a condition on that limit, except for an explicitly coded check. Thus think that CHECK_LIMITS fits well into the solution chosen.

                    A good thing is that the most of the times the check will pass, allowing branches to be predicted accurately for the dominating number of VM instructions executed.

                    Depending on the precision required, the limits could be checked less often, e.g. only every 5 VM instructions. Special attention needs to be paid to the basic blocks of the VM code, otherwise e.g. a short loop could completely bypass the check if it is comes to lie between two checks.

                    1. 1

                      I’m actually not completely sure how interrupts work in x86 at all so I cannot really answer that, but I believe that would require OS context switches since I don’t think you can have interrupt handlers in userspace. I guess your idea would be to still increase rax and then from time to time execute an interrupt that checks how we are doing with the counter right? I believe that works a lot better in microcontrollers since the amount of time that an execution takes is a lot better defined (due to not having concurrency and multiprocessing). So, and I may be very wrong here, we would have to run the interrupt every X time (with X being very very small, somewhat smaller than the smaller possible instruction limit?) so the amount of context switching may be more overhead than the cmp && jl combination of most checks. In AVR there is no context switch cost almost so that may be a better solution there.

                      Also checking with perf it seems like the branch predictor works pretty well with this program (<0.5% of branch misses) so most of the time the cost of the check is fairly small since it’s properly pipelined.

                      Let me know if my (very uninformed) answer makes sense, and if someone else with more context can add info even better.!

                      Thanks for the kind words!

                      1. 1

                        I don’t think you can have [hardware] interrupt handlers in userspace

                        That sounds right, after researching a bit, it seems like POSIX offers a way to software interrupt your program with alarm or setitimer, which I think the OS triggers from underlying hardware interrupts. And then you could pull out virtual time to estimate the cycles. Or see if software profilers are doing something more sophisticated to estimate time.

                        https://linux.die.net/man/2/alarm

                        sleep(3) may be implemented using SIGALRM;

                        https://pubs.opengroup.org/onlinepubs/007904875/functions/setitimer.html

                        Intuitively I think this would improve performance quite a bit. But measuring a real workload with & without instrumentation should tell how much performance could be gained.

                        the main idea is a Real-time strategy game (RTS)

                        Sounds very cool!

                        1. 1

                          Signals have a lot of troublesome baggage, though. It varies by OS, but it’s often very dangerous to do things in signal handlers because the thread might have been interrupted at any point, like in the middle of a malloc call so the heap isn’t in a valid state — on Mac or iOS you can’t do anything in a handler that might call malloc or free for this reason, which leaves very little you can do!

                          If your handler tries to do something like longjmp and switch contexts, it might leave a dangling locked mutex, causing a deadlock.

                          As far as I know, most interpreted thread implementations use periodic checks for preemption, for this reason.

                          1. 2

                            They’re also really slow. They need to find the signal stack for the thread, spill the entire register frame context there, and update a load of thread state. Taking a signal is often tens of thousands of cycles, which is fine for very infrequent things but absolutely not what you’d want on a fast path.

                    2. 1

                      Note that you could count downwards instead of upwards for rax, taking advantage of the zero flag (ZF) to avoid the cmp. Combine this with using the 32bit registers to avoid the REX-prefix byte needed when using 64bit registers.

                      mov eax, 15000 ; also zeroes out the upper 32bits of rax
                      ...
                      dec eax        ; 32bit decrement. Sets ZF if eax reaches 0
                      jz yield_label ; branch on ZF=1
                      
                      1. 1

                        that is a cool idea! but then I’d have to add the jz after each instruction instead of adding them sprinkled around like right now, because I’d have to catch the exact time eax is zero, right? I wonder if that would improve things or not. I guess it depends on how tight the loops and jumps are.

                        1. 1

                          Not after each one if you treat the count as signed and use JNE at certain points. That will catch 0 or less than (signed 32 bit quantity goes from -2 billion to +2 billion).

                          1. 1

                            Ah, I didn’t realize the check only happens on loop entry rather than every instruction