1. 39

    1. 24

      Or they simply are not real-time applications. ¯\(ツ)

      David Harel made the following distinction, which has been widely adopted:

      1. transformational systems
      2. interactive systems
      3. reactive systems

      (The term Reactive Programming comes from this third definition, though it has become almost completely mangled in the process).

      UI Applications are interactive systems, which means that their environment can wait for them. And it can. I don’t die if I see a SPOD, and in fact it can be exactly the correct information: the program is busy with a task I just gave it, and it will be responsive again once it is done.

      In many cases, this is vastly superior to alternative behaviours. For example, if I save a document and it takes a while, I would rather not be able to terminate the program while the save is in progress. Or a program remains theoretically active, but grays out all its UI elements. Technically no SPOD, but not useful either. And more difficult to get right, so you more often get issues like the program not reactivating its UI. Just block normally and let the OS figure it out! Don’t be deceptive by claiming you are active when there is little or nothing useful to do.

      1. 2

        Can you please point to the definitions of these three systems, per Harel?

        1. 1

          I first encountered the definitions in the paper describing Lustre.

          This references the following: On the Development of Reactive Systems.


          Real time programming: special purpose or general purpose languages

          I do remember seeing a more compact definition in one of Harel’s other papers, but can’t find it now.

    2. 15

      So all we have to do is avoid file system IO functions from the main thread? Not a big deal. That doesn’t mean UI applications are fundamentally broken.

      I think it depends a bit on the kind of I/O. If opening or saving a document freezes the UI, most of the time users won’t notice. It’s only when you do it routinely that it matters (looking at you, Thunderbird on Windows).

      This type of failure has happened to me multiple times on Linux so I know it’s a problem there. Perhaps Windows and macOS engineers have already considered this issue but I doubt it.

      macOS actually handles this kind of thing in a couple of ways. A lot of applications opt into ‘sudden termination’. This makes them even less realtime. An app in a valid sudden termination state is assumed to have saved all relevant state and will be terminated without any opportunity to recover in low memory conditions. This immediately returns all of its memory to the kernel. The display server keeps a copy of the application’s current window state so that it can pretend that the app is still running. When you switch back, it will be restarted. This is similar to swapping the whole application out and back in but also defragments it and removes all stale objects.

      If this still doesn’t work, it will send SIGSTOP to other applications. You have to explicitly send SIGCONT to them to start them running again. This lets you avoid thrashing and gracefully recover (sometimes by force quitting the run-away process). This is exposed in the UI with the same pop-up you get from command-option-escape, with a ‘resume’ button.

      Is there a way to fix this? At least on Linux there is the mlock() family of functions that tell the operating system to put and keep the process’s memory pages into RAM.

      Generally, only root is allowed these, or they have very low quotas, because they can lead to resource exhaustion.

      And this is really the issue: GUI applications are, at best, soft realtime tasks. It is better for them to occasionally miss their deadlines than it is for them to consume too many system resources. They are intended to gracefully degrade. If a system is low on memory and you can’t get work done, that’s much worse than a system being low on memory and you can get work done more slowly. There are a few exceptions. A game running at 15 FPS is probably worse than a game telling you to quit something else to run it, but PowerPoint lagging a bit in animations is much better than PowerPoint exiting until you quit your video conferencing app.

      1. 11

        They are intended to gracefully degrade

        I wrote this article because too often it seems like these applications degrade very poorly. If you use any run of the mill Windows 10 or macOS computer, you will frequently run into situations where your clicks or button presses cease to have any effect and then you get the dreaded beach ball for minute(s)! It’s not unlikely your parent’s or kid’s computer is in a similar state. The situation has become unacceptable.

        1. 5

          Windows is just a mess, but that’s unavoidable by userspace programs. Third party drivers do silly amounts of things in interrupt handlers (no other kernels have this problem: the only thing you are allowed to do in an ISR is prod a thread to wake up on other systems) and antivirus hooks into filesystem events and blocks them for unbounded time.

          Until recently, my personal machine was a 2013 MacBook Pro and I very rarely saw a beachball on it, certainly not for minutes. The only times I saw it at all were under very high load (e.g. background build of LLVM).

    3. 12

      Yep! Welcome to video game programming!

      Or, you know, you could just shrug and say “it’ll usually be fine”. And, you know, it usually will be fine!

      But you better get used to that, ’cause whatever else happens, OS interrupt handlers can take an unbounded amount of time if the OS or driver author is not careful. And that is why a dead NFS server can lock up your entire Linux desktop for a minute at a time. Or a flaky PAM permission. Or an attempt to access a dying hard drive. Or…

    4. 9

      The talk about blocking operations, including virtual memory, reminds me of Midori. From the post Asynchronous Everything:

      Synchronous blocking was flat-out disallowed. This meant that literally everything was asynchronous: all file and network IO, all message passing, and any “synchronization” activities like rendezvousing with other asynchronous work. The resulting system was highly concurrent, responsive to user input, and scaled like the dickens. […] And when I say “no blocking,” I really mean it: Midori did not have demand paging which, in a classical system, means that touching a piece of memory may physically block to perform IO.

      1. 1

        Great reference, was not aware of that. It’s validating to see the same ideas emerging independently. Thanks!

    5. 6

      I suspect this is why on BeOS every graphical application had two threads automatically, an “application” thread and a “display” thread. I think you were expected to do nothing from the UI except pass messages to the application thread. But I was a very poor programmer when I read about all this, so I could be mistaken. I would expect that if you had a realtime guarantee on such message passes, you would get a realtime application as a result—although this assumes certain things about your message queue which I don’t remember anything about at all.

    6. 5

      Strongly agree with this article. The computers I’ve used for basically all of my life have suffered from this class of problem - it makes sense that a lower-level cause of this is a fundamentally incorrect architecture that every mainstream PC and smartphone OS has imposed on software written for them.

      What if there was a way to encode in the type system of a program whether or not a function would block on an external event, and then have the OS or commonly-used standard libraries force UI handler functions to only use this class of functions? Making this happen in enough environments where it meaningfully affects my day to day computer use is certainly a large ecosystem-level change that requires wide awareness of the problem. A change on par with the adaptation of Rust (and note that event’s Rust’s superior type system compared to the mainstream status quo does not have the flexibility to mandate meaningful blockingness-checking).

      1. 4

        In the CHERIoT RTOS, I’ve introduced a timeout structure that contains the number of ticks that you are allowed to yield for and the number that you have yielded already. Anything that yields (waiting for interrupts or other threads) takes a pointer to this structure and forwards it through any other yielding calls to gradually accumulate the amount of waiting time that it’s allowed in total. It’s probably the right thing to do for control systems but I’d absolutely hate to do it in application code, for two reasons:

        • What do I set the timeout to? If I want interactive frame rates for a GUI, probably 100ms, but most of the time it’s fine for a GUI to drop a few frames. If it’s playing video, the timeout needs to be less.
        • What do my error handing paths look like if things don’t complete on time?

        The CHERIoT model also doesn’t track time spent actually doing work, which you probably would want to do, but then you need to handle exiting from compute loops in the middle because you’ve run out of allowed time. If you’re building real real-time systems, you always use bounded loops and wait for the next call to do the next bit of work (for example, our allocator always pops a small number of objects from quarantine on malloc calls, so that it eventually catches up). You wouldn’t want to do that in a GUI environment.

      2. 1

        Simpler to make all system calls non-blocking, and force application/toolkit developers to handle blocking explicitly. That way when they get it wrong a kernel programmer can point somewhere in their code and say “this should be doing…”

        1. 6

          That’s what seL4 does (the kernel has provable response times because it’s always allowed to spuriously fail). The end result is an environment that everyone hates using. And most people just wrap in looks that retry until success, punting the blocking behaviour into their wrappers where it’s less efficient.

          1. 1

            Yeah, I very carefully said “simpler” instead of “easier”. :P

      3. 1

        What if there was a way to encode in the type system of a program whether or not a function would block on an external event,

        I like this idea. Ideally this is something that the compiler would check for you. I can even imagine there being a function specification that allows a static analyzer to compute worst case runtimes for you. It would have to use some sort of virtual time unit since the runtime of code sequences varies from CPU to CPU. I read a comment on HN that said the problem is much deeper than even the programming interfaces provided by our OSes, CPU performance itself varies widely depending on if a memory access can be satisfied by cache or if the CPU mispredicts a branch. Maybe it’s still always possible to provide a worst-case time for every CPU operation, i.e. CPUs cannot take an indefinite amount of time to complete any operation.

      4. 1

        I wonder why it calls our handlers on the UI thread in the first place. Why doesnt the OS send events to the app thread for us?

        1. 3

          On most platforms, the OS doesn’t send events to any thread, a thread blocks waiting for events. Typically, you build an event-driven model by having an outer loop that waits on something like kqueue or epoll and then fires callbacks when events come in. You don’t have to wait for all events in the same queue, you can wait for things like keyboard and mouse events in one thread and other I/O in another.

          Similarly, GUI toolkits often don’t have a notion of a UI thread, they have a requirement that the GUI APIs are called from a single thread at a time (I.e. you are responsible for locking) up except in APIs designed for concurrent update (Cocoa, for example, lets you mark individual views as updated from another thread, which adds some locking, BeOS took this to an extreme). If you want to spin up another thread and run the GUI loop (and nothin else) there, you often can.

        2. 2

          It’s up to applications (or their frameworks) to decide which thread it wants them delivered to, not the OS.

          1. 2

            Would there be harm in just always executing callbacks on an app main thread? The only drawback I can really think of is latency, but it doesn’t seem super compelling so there’s probably some things I’m missing.

            1. 3

              Complexity for application authors. Your callback can’t directly access any of the components, since they’re owned by the UI thread, so you can’t do the simple “on button press, set focus to this textbox” without a mutex.

          2. 1

            As per an HN comment for the same article, there is apparently an opt-in way on android to disable file/network IO on the main thread alltogether.

    7. 3

      This is IMHO really overblown. Having an occasional bit of jank is not a big deal for me or, I think, for most users. In my experience on macOS (with 16GB RAM), I see beachballs pretty rarely, most often in Xcode which is a monster huge app that’s always been a resource hog. Having the system slow down due to paging is annoying, but less annoying than being told “you’re out of memory, quit some apps to make room,” or just having apps crash with OOM errors, like in the bad old days.

      (macOS has always been designed to support near-real-time performance for high-priority threads — that’s one reason it’s such a good platform for audio apps.)

      Having your app never, ever pause is a nice feature. But having the app ship on time, and not crash due to nasty thread-safety bugs, is better.

      1. 4

        Having the system slow down due to paging is annoying, but less annoying than being told “you’re out of memory, quit some apps to make room,” or just having apps crash with OOM errors, like in the bad old days.

        It’s worth noting that both macOS and iOS actually do the latter, but they do it far more sensibly: processes opt into being killed (on iOS, they must opt into this when moved to the background) and so they can be killed when the user isn’t looking. This has a lot of the same benefits as running a copying GC: after the app has restarted, all of the live objects are reallocated in a nicely defragmented space and any that should have been garbage are gone.

        macOS / iOS and Android also have cache infrastructure that lets applications nominate bits of memory that they know how to recreate that the kernel can discard when there’s memory pressure.

        Having your app never, ever pause is a nice feature. But having the app ship on time, and not crash due to nasty thread-safety bugs, is better.

        I think that’s really the key point. A few things (Thunderbird, I’m still looking at you) freeze because they’re doing a lot of I/O on the main thread, but most apps don’t. Going from ‘freezes all of the time’ to ‘rarely freezes’ requires not doing overtly stupid things. Going from ‘rarely freezes’ to ‘never freezes’ is a huge engineering investment because it requires propagating timeouts throughout the entire codebase and having graceful fallback for the places where you time out.

        It’s also not really useful because it means the UI doesn’t freeze, it doesn’t mean that the user isn’t waiting. If the freeze is because of disk or network I/O, you can prevent the UI from freezing but you can’t make the UI ready for what the user wants to do next.

        As with everything else, it’s a tradeoff. The cost is a lot of engineering complexity, time to market, and an increase in average-case resource usage. If people were willing to pay more for late delivery of applications that consume more memory then shipping apps that never freeze would make commercial sense.

    8. 3

      You might be right, but making my desktop stack real-time is going to take many years of work, for very little benefit. It works well 99% of the time, and there are more valuable features to build.

    9. 3

      I’m pretty sure Jaron Lanier brings up exactly this problem in one of his books – Unix has insufficient abstractions for audio. Not just Unix but basically all desktop OSes. (Although from what I understand, Android is significantly worse for audio than iOS)

      Not sure which book it was, but it might be this one


      Unfortunately google and amazon no longer seem to have full text search of books! what a regression

      Does this ring a bell for anyone?

      I like Unix for many things, but he definitely has a point about audio. Also it is indeed disturbing how much abstract ideas shape your world view … if you don’t have the “word” for something, then you can’t think the thought

      And if you don’t have the API, then you might not build the application

      1. 5

        Audio is one of those things where just throwing faster compute and more memory made a lot of the problems go away. When I learned about audio processing, sound card buffers could hold less than one scheduling quantum’s worth of audio. DirectX introduced ring buffers that let the kernel handle the ‘buffer empty’ interrupt and push more data in, without needing a system call (if you didn’t keep the ring full, it would replay older samples, which is where the stuck-record bugs came from). A few years later, it was easy to write 100ms of audio to /dev/dsp at a time and all of these problems went away (for games, maybe you wrote only less at a time, but you’d be scheduled again before you needed to write more and if you could keep up with drawing each frame you could easily keep up with writing that frame’s audio samples, for music players you might write 1s, for video you might write a few frames worth of audio).

        1. 3

          Throughput is but one small (though not insignificant) piece of the puzzle—the non-strawman case is more generally synchronisation. One variable of interest is given by SNDCTL_DSP_GETOSPACE etc. (in the case of the ring buffer—the obviously superior approach—just examining cursors) but that is incomplete, because there is basically always at least one layer of queueing between you and the sound card, and you are not perfectly synchronised with the audio server. (Video situation here is much worse because the queues go much deeper—istr there was a vk_goog_display_timing or so to putatively help.)

          There is also the issue of synchronising audio with video; an actual realtime application must handle both separately, but something like a video player can afford to be more relaxed with a longer queue—but only if the a/v synchronisation can be taken care of in a centralised fashion.

    10. 2

      Actually, most UI applications are broken spreadsheet applications. It’s really bad.

      (To be fair, the author does gesture at an actual argument for why this point of view could be useful, but I feel that the stated rationale barely goes past “Man I hate waiting.”)

      1. 2

        “Main I hate waiting” is right! I hate waiting even more when it’s already been 10 minutes and I don’t know how much longer I’ll be waiting and I’m not sure when the last time I saved was.

        In all seriousness, could you elaborate on what you think an actual argument would be?

        1. 2

          Probably whatever your response to mpweiher’s comment would cover it.

          To be fair though, it sounds like you and I work in very different problem spaces. I haven’t experienced the kind of unresponsiveness that is measured in minutes in decades. (And even then it was because our home directories were NFS-mounted.) It’s been years (decades??) since I saw Firefox or GIMP take even 1 second to respond to my user input, despite doing long-running background work.

          (I’ve certainly had command-line programs become unresponsive to ^C, due to being stuck in a system call, but even then they will usually still give right-of-way to ^Z. However, I expect that all that is generally outside of what you meant by “UI applications”.)

          1. 2

            I haven’t experienced the kind of unresponsiveness that is measured in minutes in decades.

            Catastrophic failures like the one I mentioned don’t happen often but they do happen and when they happen it’s not great, especially when it’s something that could have been avoided if the engineers who built the UI stack upon which user applications rely had surfaced these types of issues instead of obscuring them.

            Probably whatever your response to mpweiher’s comment would cover it.

            It boils down to values and arguing values is futile. It seems like they have no problem with the occasional UI stall or even longer term stalls. It’s a feature to them. There was a time when that was okay with me too but not anymore. I don’t want to be left guessing if my application will come back from the dead or not when I’m doing important work. I don’t want to use jittery apps that occasionally miss input events and force me to be stay conscious about whether or not the app is stalled before I enter input events. It just seems like the stack is not fully thought through, half-baked, and mediocre. If this is because app developers or app stack developers simply don’t care or are too lazy, that’s really not good enough to me. Of course it may be good enough to you or others and that’s fine, it’s a free country of course.

            BTW Thank you for the “A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux” article. I went through that tutorial at a young age and it motivated me to write an ELF program loader after I read it. I learned a lot and it helped me, very fond memories.

          2. 2

            Huh, I force-killed Krita less than a month ago because I accidentally ran an operation that left it completely frozen for a minute (with no sign of stopping). I was cropping down a map of a provincial park to the relevant area so I could print it for a canoe trip. The original image size is 26077 × 15750.

            Re-ordering my work a-bit to crop the map I was trying to take a slice of down first before formatting it/selecting the exact rectangle I wanted to print solved the problem, but it still took seconds for some of the operations (scaling in particular if I remember correctly) to complete.

          3. 2

            I haven’t experienced the kind of unresponsiveness that is measured in minutes in decades.

            Try using Thunderbird on Windows. It does a lot of file I/O on the main thread. This interacts very poorly with the on-demand scanning on Windows: you open a file, your process is blocked, an AV process wakes up, reads the file, does some processing, and eventually approves your access to the file. Scanning the file can easily take a couple of seconds and Thunderbird will open multiple files in a row. This was much worse with mbox because any write to the mailbox would cause the entire thing to need rescanning, so moving to a new folder after an email had been added to it could result in a multi-gigabyte file being rescanned, which easily took tens of seconds.

            Worse, much of the UI is (was?) single threaded and so a background refresh of the folder view in one window would cause an entirely separate compose window to freeze while you’re typing.

    11. 2

      This article made me think of queueing theory / Jackson networks.

    12. 2

      I’m tempted to abandon using Windows, macOS and Linux as the main platforms with which I interact.

      That’s bold. Do you have any alternative in mind, or does it need to be built?

      1. 2

        I didn’t have an alternative ready in mind when I wrote that. Someone on HN posted that QNX with Photon covers exactly all of the concerns I mentioned in the post. I also imagine that someone has already written a hobby OS that has independently designed a system with these properties in mind or wanted to make an open source clone of QNX. Long-term I do intend to migrate, I believe the industry took a misstep in the 90s or so by collectively deciding to base their OSes on a UNIX-like model instead of a RTOS-like model and eventually it will need to be rectified.

    13. 1

      Well that’s a depressing thought.

    14. 1

      That results in a tolerable amount of latency most of the time on standard disk drives but what if the file is stored on a network drive?

      Yesterday I added a file share to “quick access” in Windows 11. But to access it I needed to login and I didn’t check “remember credentials” when I logged into the share. This essentially broke file explorer (well it started working after, I don’t know, 30-60 seconds, but that’s long enough for me to think “oh, it’s broken” and close the thing).

    15. 1

      How much time is this bounded amount of time? 100ms or maybe 250ms.

      imho if it’s likely able to take more than a fraction of a frame (say, a generous 1/16th, which on my 240hz monitor is really not a lot of time… 260 microseconds) it should be spun off in a thread pool and done async.

      At least when working in arts-related fields, missing a frame is never ok. All the video artists I’ve worked with notice it instantly. So please do your job :)

      That said, if you are careful Linux even mainline makes for a pretty good RTOS as it allows you to control scheduling fairly precisely, down to where each interrupt are handled (on non-braindead systems.. not possible on a lot of older ARM boards for instance). This makes a lot of timing issues go away or just be limited by the hardware’s internal jerks and lags.