It’s interesting that this article describes spatial and temporal redundancy as different things. The best description of video encoding that I ever heard suggested that you think of a video stream as a 3D volumetric image. Adjacent voxels are adjacent voxels, whether they’re adjacent in the x, y, or z axis. It would be weird to treat x and y differently, why treat z differently? A video CODEC is just a 3D image CODEC that’s optimised for extracting planar samples. Modern video CODECs are much more explicit about this, especially those that are not optimised for realtime live encoding and so can use the z dimension in both directions to optimise the encoding of any slice along that axis.

That’s a really interesting point of view! As I understand it, JPEG is basically a 2D discrete cosine transformation (DCT) for 8x8 blocks. So would a 3D DCT on 8x8x8 blocks make a reasonable video codec?

[ Disclaimer: This is from the perspective of someone whose knowledge of signal processing is very weak and prefers to live in a world containing only integers ]

For some value of reasonable. JPEG does very badly on sharp discontinuities (because a square wave requires an infinite number of cosines to represent it exactly). In the z dimension, any scene change would show up as a sharp discontinuity and so you’d end up with the same sort of artefacts on scene changes that you get for sharp lines in JPEG. Assuming DCT works in three dimensions how I imagine that it would.

It’s interesting that this article describes spatial and temporal redundancy as different things. The best description of video encoding that I ever heard suggested that you think of a video stream as a 3D volumetric image. Adjacent voxels are adjacent voxels, whether they’re adjacent in the x, y, or z axis. It would be weird to treat x and y differently, why treat z differently? A video CODEC is just a 3D image CODEC that’s optimised for extracting planar samples. Modern video CODECs are much more explicit about this, especially those that are not optimised for realtime live encoding and so can use the z dimension in both directions to optimise the encoding of any slice along that axis.

That’s a really interesting point of view! As I understand it, JPEG is basically a 2D discrete cosine transformation (DCT) for 8x8 blocks. So would a 3D DCT on 8x8x8 blocks make a reasonable video codec?

[ Disclaimer: This is from the perspective of someone whose knowledge of signal processing is very weak and prefers to live in a world containing only integers ]

For some value of reasonable. JPEG does very badly on sharp discontinuities (because a square wave requires an infinite number of cosines to represent it exactly). In the z dimension, any scene change would show up as a sharp discontinuity and so you’d end up with the same sort of artefacts on scene changes that you get for sharp lines in JPEG. Assuming DCT works in three dimensions how I imagine that it would.

Funny how well that LOCO-1 predictor seems to work despite its crudeness. Also it’s pretty cool how Rice coding degrades into unary encoding when k=1.