So we have 11 bits of precision, which is log(2**11) digits of precision, which is roughly 3? What kind of computations need only 3 digits of precision? I’m genuinely curious, since it must have a use (otherwise it wouldn’t be added in).
Deep learning models have been shown to not lose significant performance by using bfloat16/Half. Furthermore, you can build models that are twice as large, sometimes giving a performance increase.
IIRC half float has seen a bunch of use in graphics for HDR rendering, where the extra range is very useful. 11 bits is fine for that since 10 bit per channel monitors are rare.
Why not just use a 16-bit uint and then scale it down to the device depth at the end? (In other words, fixed-point with a range 0-1.) That gives you around 200x the color resolution of a display, which seems like more than enough, without the overhead of dealing with FP.
The goal when doing HDR rendering is to be able to represent sunlight (~100k lux) and candlelight (~10 lux) and maybe even moonlight (~0.1 lux or less) in a single image.
Half float gives you 5 bits of exponent so the smallest and largest representable finite noon zero numbers are about a factor of 2 billion apart. 16 bit linear gives you only a factor of 65535 between the smallest and largest representable non zero numbers.
HDR doesn’t demand more precision, it demands more range.
After rendering the scene to a HDR buffer, post processing steps (bloom, exposure control) will be used later to transform the high dynamic range image into something convincing-feeling that fits on your monitor’s narrow dynamic range.
My encounters with f16 types has mainly been neural net stuff, where using twice as many f16’s often gets one better performance than the same number of bytes of f32‘s. Something something more degrees of freedom at lower resolution mumble mumble. 11 bits gets you basically 0.05% precision, so the overall precision isn’t that bad, errors just add up quickly.
This is neat - I always wondered why several languages have Float32 but nothing smaller