We present BLAKE3, an evolution of the BLAKE2 cryptographic hash that is both faster and also more consistently fast across different platforms and input sizes. BLAKE3 supports an unbounded degree of parallelism, using a tree structure that scales up to any number of SIMD lanes and CPU cores. On Intel Cascade Lake-SP, peak single-threaded throughput is 4× that of BLAKE2b, 8× that of SHA-512, and 12× that of SHA-256, and it can scale further using multiple threads. BLAKE3 is also efficient on smaller architectures: throughput on a 32-bit ARM1176 core is 1.3× that of SHA-256 and 3× that of BLAKE2b and SHA-512. Unlike BLAKE2 and SHA-2, with different variants better suited for different platforms, BLAKE3 is a single algorithm with no variants. It provides a simplified API supporting all the use cases of BLAKE2, including keying and extendable output. The tree structure also supports new use cases, such as verified streaming and incremental updates.
Can recommend. Used this in bitbottle (posted here last year).
Section 7.5:
heh, yeah.