Close to ten years ago I happened to be at a conference where I met this very enthusiastic fellow that did “signal analysis” on data streams on behalf of some “telco”. He explained and demonstrated how they reconstructed plain-speech from encrypted skype streams with (insert adaptive yolo statistics ML of the time) fed with phonetic models, metadata from src/dst geofingerprints, packet sizes and timing etc. I recall suggesting that this seemed like an awesome compression opportunity. The reply was something akin to “that’s not an application domain we are considering” with a very Indiana Jones 2y vibe. His german accent did not do him any favours.
This sort of creepyness comes up every now and then - it is not unlikely that video or even video+audio will be plausibly synthesized from shit people have already given to facebook-as-instagram et al. in exchange for virtual peanuts. When there are more palatable application areas for the very same techniques – well there is the ‘deniable’ to plausible-deniable that will force open the flood gates.
So, instead of decrying privacy for the nth time, now, if not yesterday, is the time to research, design and develop techniques and social internet spaces that trains people towards seeing the inherent uncanny-valleyness-bullshit and call it out. Evolution marches on and all that.
I don’t think it will catch on as a compression per se. Video calls are mainly limited by latency and instability of connections, rather than purely lack of 1KB of bandwidth. I assume GAN isn’t going to be cheaper to process than a classic video codec, so this won’t enable 100+ participant video calls either.
However, I can imagine someone making a black-mirror-esque startup for customer support video calls. Imagine a GAN that makes your customer support reps all look like always-smiling clones of blonde Fox News anchors, regardless of who they actually are and how miserable they look in reality.
This is just a link to a promotional video by NVIDIA about the technology. It would be much better if it was accompanied by a mindful writeup.
That’s fair. I’ll take the time to write something up about it and post it as a comment.
This seems like it would, over time, desensitize users to deep fakes. The image doesn’t look great (see too many teeth and dead eyes, but I’m more worried about the implications of remote workers seeing their coworkers through a neural net more often than in person
Well, I think that eventually we are all chatting in virtual spaces with out avatars. Meanwhile, enjoy preview.
Also, obviously, VRChat.
While pretty cool, my major worry is that faces will be biased toward whatever the sample set is – we’re all gonna start looking like whatever Nvidia thinks we should look like. Admittedly this happens on a smaller scale whenever engineers pick coefficients for lossy compression algorithms, but usually that’s done based on psychoacoustic or psychovisual (?) terms as opposed to “this matched our sample set.”
Bias is ‘loss’ in lossy compression; skin is smoother and blemishes fewer at lower bitrates as well in other encoders and so on. Predictable motion will be have favourable compression, arrive faster and appear smoother in streamed video – The point is, biases trickle in regardless of what you do. It is just that the author of the algorithm doesn’t have some magical conspiracy certified control switch. If the end results are not plausible it’ll not be adopted and you can verify the output by streaming to yourself.
If you look at examples of the same set of technqiues (search for ESRGAN in gaming and emulation forums) being applied for upsampling of old video games and current realtime raytracing (the other side of the sameish coin) you’ll see that it is more of a quality filter than it is the more questionable ‘style transfer’ (different but related).