1. 26
  1. 8

    This is a really good idea that I wish was carried over to Linux somehow. I find it frustrating that Linux regularly has issues with the GPU drivers, and once they crash, there’s no way of recovering the system (with KMS at least) or falling back to VESA drivers. I wonder how difficult it would be to implement…

    1. 5

      “how difficult” and “no way” well it depends on how far down the rabbit hole you care to go. https://arcan-fe.com/2017/12/24/crash-resilient-wayland-compositing/

      I have gone quite a bit further since then, all of the following is implemented in Arcan, albeit the last steps being experimental.

      Short summary of steps (article + current state).

      1. WM level - easy, all WM to scene-graph operations as line-format to external process or, better still, a virtual machine that can act as both (Lua is a great fit due to how the interpreter integrates as it latches on to other steps). Database store for ‘wm-side’ logic with last safe dynamic state (size, workspace) and per-client allocation GUIDs to re-pair. Bonus: doing this gets you ‘hot reload’ for coding the WM while using the WM. Added points for having a fallback WM for when you break your WM from your WM in the repairing step.

      2. Server crashes - needs client side support; detection mechanism for parent crashes and a way to re-discover how to connect to the server. Bonus: this opens up some interesting features since the WM can now tell clients to go elsewhere and make it look like a crash. Combine that with a network proxy and you get https://www.youtube.com/watch?v=_RSvk7mmiSE

      3. WM or server live-locks - split off into a privileged watchdog process (needed anyhow unless you think logind/dbus is a good idea) and update a shared memory timestamp when you enter critical sections (WM interpreter, render loop, scanout). If the watchdog timeout triggers, try a soft recovery (i.e. WM level hook from step 1) and, failing that, kill the process and have step 2 activate. This fixes some GPU crashes as well.

      4. Cooperative GPU support - there is a mechanism for GPUs to say that they have lost state. They are, in my experience, not super reliable (see GL_KHR_robustness) but your browser typically needs to do this often when you browse ‘shadertoy’ or anything else WebGL really. Chances are the workload that triggered the first crash will re-trigger it so better still is to be able to have an extra GPU. Wire the display server to be able to completely rebuild server-side resources (including dynamic assets like render-to-texture like passes and shaders), the proper way to implement VT- switching anyhow, and just swap the GPU used. This can never be reliable as capabilities of GPU a might mismatch GPU b. That’s where it helps to have 2. as a fallback. Then any clients still using that GPU needs to be able to understand that it should stop, so it requires the display server protocol/api to be designed for that eventuality.

      5. Uncooperative GPU support - chances are that your GPU will be useless until reboot since kernel-side part of GPU drivers are just the worst. If it brings down your kernel, well, good luck, if it doesn’t - this is where the ‘network proxy’ part of 2. comes into play. Designate an external fallback machine, hook it up to the network proxy, have the clients migrate there. Bonus points if everything is designed in such a way that state can be backed up and serialized over this connection.

      1. 1

        Uncooperative GPU support - chances are that your GPU will be useless until reboot since kernel-side part of GPU drivers are just the worst

        Somehow Windows is really good at this. Most of the time even when pushing wildly unstable overclocks for maxing out benchmark scores, the driver can successfully restart without restarting the whole OS.

        1. 2

          So there is a culture of this mindset on the window side that, for the graphics subsystem, dates back to the early directX days. You were sort of forced to think of ‘v’ in VRAM as ‘more-volatile than regular’ RAM - you could lose your contents at any time and reliably/repeatable so by switching back and forth from fullscreen to desktop, which also meant it got some massage.

          In the realm of wild speculation, of course, but the relationship ‘blob drivers’ and windows kernel devs have also been more symmetric in terms of power/influence during negotiations than on the foss- side where you have a much more polarized (‘we are doing them a favour’ || ‘sod off, we don’t need you’) dynamic.

    2. 6

      I’d love to hear @crazyloglad ’s take on this, relative to the Arcan security model and its Linux and OpenBSD implementations.

      1. 2

        So first see the other lengthy reply I wrote - as they sort of relate and some of the things claimed ‘impossible’ I have demonstrated, but of course not in the context of Genode. That is more in terms of the ’Integrity” and ‘Availability’ of the CIA model.

        The other bit is that it is harder to comment on the security aspects without going much deeper into threat modelling, incidentally I have an article written on most of this, but I passionately hate dissemination and the internet attention that goes with it; 2020 has been an absolute nightmare and the walls are creeping in.

        Personally, I much favour ‘per device’ security domains, as rowhammer is still alive and well, and I have a strong hunch it will come back with a vengeance; there is no realistic way of securing the ‘browser’ and anything else you want to protect, there seem to be endless side channels for. In the same vein as sound proofing does not do that much against stopping your neighbours from hearing you having loud and violent sex, especially not if they attach microphones to the wall or use your wifi as a cheap radar.

        What I have higher up on the threat model is the name of the game is ‘information parasites’ that piggybacks on some necessity for clients to have ephemeral access to (webcam, display, microphone), this relates to the framebuffer protection in the article - but it does not have to. The thing is that the browser/app/whatever ‘do you want to allow xyz to access’ model of security HCI is just so easily circumvented by just being annoying as hell while being socially/professionally necessary. I elude to what I believe is a better solution in the “chain of trust” part of https://arcan-fe.com/2020/02/10/leveraging-the-display-server-to-improve-debugging/

        1. 1

          Oh man. Thanks for coming out of hiding this much at least. I guess on a more positive note re per-device security domains, RasPi-class machines are so cheap and prevalent now that self-hosted micro-clusters are within reach for most of us paranoids as hypervisor replacements. Of course then there’s still the issue of securing client access. And external networking.

          Looking forward to seeing the latest on Arcan / Durden, even if it’s after the apocalypse. Very inspiring work.

          1. 2

            No problem - I hope news will come way before the apocalypse, posts are written and videos almost recorded. There is something quite interesting coming as a ‘side project but not really’ with all of this; not to spill the beans just yet but imagine, if you will, how the ideas of ‘ultimate plumber + userland.org + eagle mode’ could be combined.

            As for SBC- class hardware, look into the sopine clusterboard - that’s what I use here. Also peek at src/a12/net in the main repo…