1. 51
  1.  

  2. 16

    It’s always good to hear people rant about latency, as there’s sources of latency everywhere and nowhere near enough ranting.

    1. 3

      My vague recollection is that @crazyloglad’s Arcan “Desktop Engine” also eliminates most/all compositor latency.

      1. 9

        (Sorry in advance for the long winded comment…)

        Well it doesn’t (yet) but the explanation is rather straining, there is a 7k+word article on the matter under the name “chasing Moby Blit” in the draft bin that I will quite likely never finish – but I do like the title. It is a really devious topic, practically the hardest problem in system ‘graphics’ as it is really about the entire system. There is good reason as to why Nvidia fought so hard to get extreme vertical integration from G-Synch in the display all the way to every userspace layer.

        The important thing to keep in the back off your head is that tuning knobs only in the video layer by adding delays here and there doesn’t get you very far in the near term, and shoots you in the foot long term. Pretty much all measurements on the matter, including Luus terminal ones, are childishly naive - system load matters. Application domain matters.The introductory part to the topic is really to study queue theory and cognitive psychology.

        The most reliable practical source for study and experience here is still small subsets of emulation communities (if you can tolerate quite a lot of eccentric behaviour) as they have an intractable problem to smooth over, literally. Tons of weird display systems have analog properties that are exploited to great effect, the nominal synchronization rate you need to emulate is rarely evenly divisible on the system the emulator is running on. Combined with ‘speedrunners’ and ‘tool assisted speedrunning’ you actually get pretty harsh test cases. ‘We’ (to the extent I still belong to the collective noun there) have been fighting this dragon since the nineties, it just got a higher priority after CRTs started to die out.

        Anyhow, what is in Arcan for a while is miles beyond what Raph is bringing up here (in terms of tactics), but the conditions are so variable that the results aren’t visible yet. I don’t care much about the (video) compositor side of the thing as it is the more trivial part. Input, Audio, Client processing, Window Manager, Memory Pressure, System call jitter and so on matter more in my world.

        Practically speaking, on just the touch to display case I’ve probably clocked in at a few hundred hours of staring at things like this https://www.youtube.com/watch?v=bW07-iPqaEk&t=7s alone, not including the rest that entails (incidentally that is Arcan observing a robot touching Arcan). I so wish I had access to https://github.com/wolfpld/tracy at the time.

        1. 2

          That profiling video is neat. I am glad people are starting to pay attention to this again.

      2. 2

        when the user is resizing the window […] the app, which gets a notification of the size change […] the window manager, which is tasked with rendering the “chrome” around the window, drop shadows, and so on

        This is why GTK/GNOME were 200% right to go all-in on client-side decorations and relegate server-side decorations to “legacy compatibility junk” status.

        the compositor could be run to race the beam

        Wouldn’t that just be Weston’s repaint scheduling algorithm that was even linked to in the article?

        I’m not sure how literal “race the beam” would make sense in a compositor context. It’s not like the GPU gives feedback (and opportunity to stop everything) per scanline?

        1. 6

          This is why GTK/GNOME were 200% right to go all-in on client-side decorations and relegate server-side decorations to “legacy compatibility junk” status.

          In short, just no. Regardless of their motivation, that specific case is doesn’t factor in when you make that design decision. There is minor justification for the topbar contents, though it could still be solved better with a hybrid approach. Border etc. can be made trivially and predictably faster on the server side. The aversion from the GNOME case comes because they have no practical experience with actual SSDs, just PTS (post-traumatic or presentation-time? stress) from the insanity that you have to do in X to achieve a somewhat similar effect.

          At higher refresh rates and bitdepths, SSDs wins out as shadow + border can be done entirely in shader on GPU during the final composition while the shadow stage alone can cost milliseconds elsewhere in the chain. Heck if the client stops decorating you can even compensate for the client not keeping up by keeping the decorations reacting smoothly and estimate the contents in the crop or expand area.

          From from the Wayland angle, the dance you have to do with synchronized subsurfaces to achieve the drag resize effect with CSDs (which applies to clients with mixed-origin contents where the contents are already costly) is expensive, complicated, error prone and nobody has gotten it right and efficient for all invariants.

          Wouldn’t that just be Weston’s repaint scheduling algorithm that was even linked to in the article?

          That is specifically what I referred to above as shooting yourself in the foot in the post above, and I have the same bullet holes from long ago. Look closely at the “for lols” case here for instance (where also, there are no resize synch issues with SSDs, it worked on a Rasperry Pi and that was 2013..) https://youtu.be/3O40cPUqLbU?t=94 - that overlay in the talespin nes emulation window…

          1. The extra sleep to give the client some time to react only works for cheap clients, with the wrong fonts and higher end densities today and even terminal emulators fail to pass as ‘cheap’ then.
          2. The tactic blows up if both consumer and producer utilises the same strategy as now you converge on jittering around the deadline.
          3. Swap-with-tear-control sort-of avoids 2, but now frames are even less perfect.

          Edit (missed this):

          I’m not sure how literal “race the beam” would make sense in a compositor context. It’s not like the GPU gives feedback (and opportunity to stop everything) per scanline?

          There is a fun book on atari graphics called specifically racing the beam that is recommended reading. Also per-scan line effects were common in a bunch of graphics systems, see how the amiga implemented its “drag statusbar to reveal client” effect for instance. Or even see it happen: https://www.youtube.com/watch?v=RagLKuQBlsw

          Regardless, the timing constraints to actually update per scanline are quite lenient as long as you start at the right times (lines / refresh-rate), the contents need to just be more deterministic, not asynchronous lies as anything GPU synthesised is. If it is just about composition, X can do it still. It’s when multiple producers work on the same buffer that you get a ticket to the tear fest. There are prototype branches in Arcan somewhere that does it for shm like contents somewhere still, and it’ll come back for text-only clients specifically.

          1. 2

            SSDs wins out as shadow + border can be done entirely in shader on GPU during the final composition

            Well it could also be done on GPU in the client, together with the rest of the window contents, as GTK4 does.

            Yeah, yeah, current gtk3 apps use CPU and would benefit from compositor side shadows, but the big GTK4 transition is coming :D

            clients with mixed-origin contents where the contents are already costly

            Hmm, I don’t know any situation where CSD would be the only thing forcing the usage of subsurfaces.

            Even in a simple video player, where you would be happy to just present a VAAPI-decoded surface and let the compositor decorate it… whoops, you wouldn’t want to leave the player without any UI controls, right? So you either have to overlay the UI with subsurfaces (and sync its resizing!) or composite it with the video yourself in GL. And then you handle CSD the same way.

            The extra sleep to give the client some time to react only works for cheap clients

            Depends on how much time in the frame budget you reserve for the compositor. A simple one like Sway should always render within 1-2ms – then “cheap” is “not pushing against the limit of the frame budget”. This is much harder for us in Wayfire since we like to wobble our windows and whatnot :)

            High refresh rate monitors make this harder – 1ms out of 8 is a more significant chunk than out of 16 – but they also kinda reduce the whole need for this kind of trick since the latency between the frames themselves is lower.

            not asynchronous lies as anything GPU synthesised is

            But that’s what I’m talking about – the fact that everything is async GPU stuff these days!

            1. 3

              Well it could also be done on GPU in the client, together with the rest of the window contents, as GTK4 does.Yeah, yeah, current gtk3 apps use CPU and would benefit from compositor side shadows, but the big GTK4 transition is coming :D

              Except you are still wasting memory bandwidth with the extra buffer space, you are drawing shadows that can be occluded or clipped at composition, and should take care that the size of the shadow + border does not push the contents outside of tile boundaries. Not to mention (since we can talk wayland) the wl_region annotation for marking the region as translucent, forcing the compositor side to slice it up into smaller quads so you don’t draw the entire region with alpha blending state enabled. Then we come to the impact of this for compression. Lastly, since visual flair boundaries should be pushed, the dream of raytraced 2D radiosity lighting dies…

              Regardless, I will keep on doing this to gtk: https://videos.files.wordpress.com/vSNj5b5R/impostors_dvd.mp4 - though practically the only toolkit application I still use is binary ninja/ida and that’s thankfully Qt. Though having that also lets me do this https://gfycat.com/angelicjointamericanrobin so hey..

              Hmm, I don’t know any situation where CSD would be the only thing forcing the usage of subsurfaces. Even in a simple video player, where you would be happy to just present a VAAPI-decoded surface and let the compositor decorate it… whoops, you wouldn’t want to leave the player without any UI controls, right? So you either have to overlay the UI with subsurfaces (and sync its resizing!) or composite it with the video yourself in GL. And then you handle CSD the same way.

              No sure you have things like the GTK3 abuse of that for their pseudo-popups as well. I have a beef with subsurfaces as the way out for CSDs specifically (well and how much complexity it adds to the wayland implementation itself), not for clipped embedding (though hey, subsurfaces aren’t supposed to be clipped according to spec..).

              Depends on how much time in the frame budget you reserve for the compositor. A simple one like Sway should always render within 1-2ms – then “cheap” is “not pushing against the limit of the frame budget”. This is much harder for us in Wayfire since we like to wobble our windows and whatnot :)

              I see your https://github.com/WayfireWM/wayfire/blob/master/plugins/wobbly/wobbly.cpp and raise with https://github.com/letoram/durden/blob/master/durden/tools/flair/cloth.lua#L17 - clothy windows are obviously superior to wobbly. https://videos.files.wordpress.com/zmiBKUyQ/snow_dvd.mp4 - also note how they behave differently based on decorated or not because hey, SSDs. But that is years ago, wait until you see the next thing in I have in the pipeline (which relies even more on SSDs to even be functional)..

              High refresh rate monitors make this harder – 1ms out of 8 is a more significant chunk than out of 16 – but they also kinda reduce the whole need for this kind of trick since the latency between the frames themselves is lower.

              No what I mean is specifically for the quality of these operations as the jitter becomes even more noticeable as animations go from smooth to blergh. It is more likely to happen in the drag- resize case rather than steady state from the storm of reallocation, cache/mipmap invalidation, … There’s a lot more to that scenario when you factor in wayland but wall of text is already there.

              But that’s what I’m talking about – the fact that everything is async GPU stuff these days!

              SVG and text rendering wants a word, NV_Path_Trace won’t happen, remember?. That the processing is async doesn’t mean that the rest of your display system is allowed to just throw its hands into the air, you need to get much more clever with scheduling to not lag further behind and implicit synch has poisoned so much here. Fences should’ve been the norm 7+ years ago and they still aren’t generally possible.

            2. 1

              Does “racing the beam” work on non-CRT displays? My understanding of “racing the beam” is that you’d race a single, physical electron gun that was quickly firing at the screen back and forth in a series of lines (the “scanlines”). But LED and OLED displays, for example, don’t have the same notion of scanlines. Is there an equivalent deterministic race you can play with them? From my (limited) understanding of LED refresh tech it seems hard to pull off, especially since LED screens have all sorts of magic happening behind the scenes that varies by manufacturer, e.g. frame interpolation or black frame insertion, that could potentially mess with the timing… But I’m not a graphics engineer so I don’t know a whole lot about this topic.

              1. 2

                Not in its traditional form no, you are racing the buffer that is being scanned out rather than the display itself, so it is more ‘drawing to the front buffer, hoping that whatever scanout engine reads it linearly (and some don’t, certain samsung phones comes to mind).

                1. 2

                  Sort of, yes. Instead of racing the electron beam, you’re racing the HDMI/DisplayPort stream, rendering pixels just in time to send them down the wire.