Threads for munro

  1. 1

    With control-flow futures, two synchronisations (e.g. get(get(nested_future))) are necessary to fetch the result: one waiting for the task that delegates and another for the task that resolves the future. With data-flow futures, a single synchronisation (e.g. get*(nested_future)) is necessary to wait for the resolution of both tasks.

    😳 Sounds like they’re trying to figure out async semantics for some WIP language called Encore [1]. async in this case meaning waiting for some concurrent computation (not IO). Seems kinda of cool, the syntax look awful though. Julia syntax looks nicer for this type of thing [2].

    I haven’t really played with these types of distributed semantics too much, but I’d imagine a majority of my effort would be messing around with data locality & movement.

    I also find the async computation semantics a bit odd, because the concurrent pieces are already encoded in the logic. Example [x * 2 | x <- [1..1000]] the x * 2 can already been seen as concurrent—so is it faster to distribute out the x data, compute x * 2 concurrently, and bring the results back? Think of all the fun you can have figuring that out!

    The other piece sort of hanging out in the back of the room is incremental computation. Great, I wrote the most optimized way to concurrently compute some data—now do it again with just a bit more added. It also seems like a piece of the data locality issue, if later on a mutated copy of data can be computed concurrently, only moving the changes to do that seems more efficient (but maybe not!).

    Looking at stuff like this makes me feel like the gap between where I’d like programming and where we’re at is massive. So FWIW these days I just use a massive machine with 112 vCPUs and lots of ram, so I don’t have to deal with data locality.



    1. 1

      VSCode is amazing and scary at the same time, i’m sure i’ve already been hacked through some extension XD

      1. 1

        just shooting my mouth off, but yea hardware is looking so attractive compared to $70k/mo cloud bill. the thing that depresses me is that it would mean going back to old school ops practices. all we got out of this cloud era is garbage tools kubeternes & docker. nobody in their right mind would run kubernetes on their own HW (which is probably by design), it’s terribly designed. fortunately tools like Rust & even Python’s poetry have really started fixing the code isolation, I would 100% feel safe running Rust apps on a server without docker (90% safe for Python XD). but man the orchestration…. what happens if I need to upgrade the kernel or add a new HDD? etc etc etc

        1. 3

          Hmm, I feel like there’s too much to respond to here. I’ll chose the second/third sentence.

          the thing that depresses me … all we got out of this cloud era …

          I guess if you are asserting you don’t like the current era of tools, you would rewind history back to where you liked it. So if we roughly went (pardon the reduction in parens):

          1. Mainframes (dumb terminals or teletype)
          2. Distributed PCs (as in desktops, desktop as servers)
          3. Bare metal servers (first “servers” based off commodity parts)
          4. Virtual machines (saturating your servers with vmware etc)
          5. Containers (defining your servers for shipping)
          6. Abstracted services / mesh / k8s (your servers are YAML now)

          Then you can rewind time to an approach you like. But then you are asserting that we made a mistake somewhere. I see some people going all the way back to dumb terminals even now in a way. There’s cloud gaming (screen painting) and rumors of Windows 12 being cloud-only. It’s not all or nothing but the pendulum of control vs flexibility or economies of scale I think are fairly pure. You want to centralize for control but then your costs are very high (surprise! that’s what #1 -> #2 was).

          So if you rewind time to #3 and use ansible, you probably aren’t going to saturate your servers. This wasn’t entirely the point of era #4 but there was some sales pitch at that time. “Don’t ssh in to manage your servers! Use chef/puppet/ansible! No pets!”. So if you rewind past cattle, you have pets and you’re saying pets are ok for you. And mixed in here are many other things I can’t mix into this layer like where does cloud fit in? Cloud sort of forces your hand to use more software definitions. The vendor probably has an API or some kind of definition tool. It’s interesting that this didn’t happen in the same way around #3 because you weren’t entering their domain to rent their servers where you are a guest that must conform because they have many customers and you are just one.

          This is mainframe vs app mesh debate is very current in the HPC world. I think many have settled on hybrid. This isn’t surprising to me. I personally err on hybrid, always. I guess I’m a hybrid absolutist (I guess I’m an absolutist?). Get your infrastructure to the point where you can make an infrastructure purchase decision in your own colo or the cloud. Have connectivity/tools/skills/culture/money ready to do either at any time. Mix and match. Of course, there are always caveats/trade-offs.

          Idk how to unpack the bits about poetry, kernel, HDD without writing a ton more.

          1. 2

            There are several forks/flavors of Kubernetes, like minikube, KinD, and k3s, which can be run at home on commodity hardware. I used to run k3s at home, and probably will again soon.

          1. 3

            i played with io_uring through tokio in rust (… a year ago or more?), shit was so fast. i kept thinking something was broken in my benchmarking code. i still don’t believe it tbh

            1. 4

              voice control systems need less delay, I wish they were virtually instant. The pause between speaking & waiting to see if it does what you want just can’t beat how instantaneous the keyboard is for me

              1. 1

                Agreed, especially if it misinterprets what you say. You then have to wait for it to be wrong on top of everything else.

                I once asked a voice assistant to refer to me as Hawk, it replied: Did you say HOK?

              1. 1

                Animations in Julia. It’s so easy to make cool GIFs & and animated graphs

                1. 1

                  love me some VHDL, it looks so lovely. i always wanna do more hardware, and always think reconfigurable computing is missed opportunity—GOTTA GO FAST 🦔

                  1. 2

                    i love the sentiment, i hate python, but i write a lot of it. so i posts like this makes me feel people are working on improving my life UwU

                    1. -4

                      Switch languages. Python is poisoning your mind.

                      1. 4

                        For many of us, Python is an integral part of our career. I have yet to work somewhere that did not have a Django or Flask application.

                        1. 1

                          I’m sorry for your loss.

                    1. 2

                      How do we make sure that the unit vm only executes N instructions before it suspends itself? A pretty lightweight solution that requires very little setup is to insert a check that we haven’t exceeded the number of operations every K ops and after every label (in the unit VM you can only jump to labels, so we catch all loops, branches and sketchy stuff with this).

                      Love the post! very succinct & easy to read

                      Why not use interrupts? Seems like CHECK_LIMITS adds a lot of overhead if it’s required after every label — especially the small looping example. I’ve only used interrupts on AVRs (really easy to use!), so this may be a really naive question. But I’d imagine you may also be able to use it to estimate instructions ran :O

                      1. 1

                        A question to ask in return: How do you raise an interrupt (or a signal, for that matter, I will use the term exchangably) after a certain number of VM instructions have executed?

                        Assuming you get access to an interrupt, it will probably be raised after some time duration has passed or after the VM host has executed a certain number of native instructions. If I understand correctly, the author wants to check against a maximum of guest instructions executed. It will be difficult to find that a condition on that limit, except for an explicitly coded check. Thus think that CHECK_LIMITS fits well into the solution chosen.

                        A good thing is that the most of the times the check will pass, allowing branches to be predicted accurately for the dominating number of VM instructions executed.

                        Depending on the precision required, the limits could be checked less often, e.g. only every 5 VM instructions. Special attention needs to be paid to the basic blocks of the VM code, otherwise e.g. a short loop could completely bypass the check if it is comes to lie between two checks.

                        1. 1

                          If I understand correctly, the author wants to check against a maximum of guest instructions executed.

                          I think instrumenting is the best way to do that, to calculate accurate statistics. but I assumed the author wanted the code to run AS FAST AS POSSIBLE, bc they were saying they wanted to run a lot of sims simultaneously, but then wanted to stop runaway bad code & report on it–so I believe interrupts would be the best way to do that

                          but then again, it’s interesting technical question, but how does this fit into the game they’re developing XD

                        2. 1

                          I’m actually not completely sure how interrupts work in x86 at all so I cannot really answer that, but I believe that would require OS context switches since I don’t think you can have interrupt handlers in userspace. I guess your idea would be to still increase rax and then from time to time execute an interrupt that checks how we are doing with the counter right? I believe that works a lot better in microcontrollers since the amount of time that an execution takes is a lot better defined (due to not having concurrency and multiprocessing). So, and I may be very wrong here, we would have to run the interrupt every X time (with X being very very small, somewhat smaller than the smaller possible instruction limit?) so the amount of context switching may be more overhead than the cmp && jl combination of most checks. In AVR there is no context switch cost almost so that may be a better solution there.

                          Also checking with perf it seems like the branch predictor works pretty well with this program (<0.5% of branch misses) so most of the time the cost of the check is fairly small since it’s properly pipelined.

                          Let me know if my (very uninformed) answer makes sense, and if someone else with more context can add info even better.!

                          Thanks for the kind words!

                          1. 1

                            I don’t think you can have [hardware] interrupt handlers in userspace

                            That sounds right, after researching a bit, it seems like POSIX offers a way to software interrupt your program with alarm or setitimer, which I think the OS triggers from underlying hardware interrupts. And then you could pull out virtual time to estimate the cycles. Or see if software profilers are doing something more sophisticated to estimate time.


                            sleep(3) may be implemented using SIGALRM;


                            Intuitively I think this would improve performance quite a bit. But measuring a real workload with & without instrumentation should tell how much performance could be gained.

                            the main idea is a Real-time strategy game (RTS)

                            Sounds very cool!

                            1. 1

                              Signals have a lot of troublesome baggage, though. It varies by OS, but it’s often very dangerous to do things in signal handlers because the thread might have been interrupted at any point, like in the middle of a malloc call so the heap isn’t in a valid state — on Mac or iOS you can’t do anything in a handler that might call malloc or free for this reason, which leaves very little you can do!

                              If your handler tries to do something like longjmp and switch contexts, it might leave a dangling locked mutex, causing a deadlock.

                              As far as I know, most interpreted thread implementations use periodic checks for preemption, for this reason.

                              1. 2

                                They’re also really slow. They need to find the signal stack for the thread, spill the entire register frame context there, and update a load of thread state. Taking a signal is often tens of thousands of cycles, which is fine for very infrequent things but absolutely not what you’d want on a fast path.

                        1. 2

                          cat flo.txt | grep 'bw=' | awk '{ print $4 )* | sed -e 's/K.*//' -e 's/\.//'

                          An independent 3rd party professional benchmarking firm was stripping the unit, and deleting the decimal point

                          1. 2


                            Paste this in your JavaScript console if you wanna quickly see all the images! (except the ones that go to HTML pages :P)

                            [...document.querySelectorAll('.comment_text a')]
                                .map(x => x.href)
                                .filter(x => x.match(/png|jpe?g/i))
                                .forEach(x => document.write(`
                                    <a href="${encodeURI(x)}">
                                        <img src="${encodeURI(x)}" style="max-width: 32%; float:left">