This article misses one of the core differences between ByteBuffer and ByteArrayInputStream/ByteArrayOutputStream: the former is not thread safe while the latter is. Granted, when you are deserializing, you are usually reading from the buffer from only one thread.
It is a also a bit disappointing that there is no ByteBuffer-backed structure (also not thread safe and also fast) that dynamically allocates new buffers to support writing in data whose final size you don’t yet know. Sure, I could roll my own (or use one of the many half-baked library implementations), but given that the JDK supports ConcurrentSkipListSet, surely it has space for this simple addition?
Oh! That could explain a lot. Taking a mutex costs about ~15ns on contemporary CPUs, even in the completely happy zero contention path (since that is about how long an uncontended lock cmpxchg takes, last time I benchmarked it on a single-socket dual core Intel laptop from about 2017ish).
Honestly, “thread safe byte stream” sounds like a really silly API to me. If two different threads are pulling bytes out of the same file descriptor, it’s almost completely implausible that they won’t tread on each others’ toes. Even with a mutex guarding their read() calls so that each attempt to get a given number of bytes forms a critical section (even if it happens to involve multiple underlying read(2) syscalls), the only plausible data formats where you wouldn’t cause confusion would have to be ones where each record is fixed-size (so you never get one thread causing another to read halfway through a record) and the ordering of the records doesn’t matter at all.
I find it a little bit freaky how different high level abstractions, with seemingly similar semantics, can have wildly different performance properties. How can we make these easier to reason about?
Are performance properties ever easy to reason about? How fast your program runs is sort of the ultimate leaky abstraction.
Higher level languages do more to hide this from you, but making those languages fast in all cases is generally harder (Julia, Haskell, etc.)
Normally you can find a hot spot and optimize around it. But when the thing is part of the language, one typically brushes past the details and assumes the problem is in their own code