I wanted to generate a Z-curve (in order) a while ago. The fastest approach I found (iirc, faster than using pdep and pexp) was just to have a recursive function that used only increment and multiply-by-two. So you have 32 or 64 stack frames and you spend most of your time in the bottom few frames. The caching and data dependency behavior of this code is very good, so you get excellent performance on superscalar OoO processors. The downside is you must go in order - you can’t efficiently seek the nth point on the Z-curve.
I’m not sure how one would conveniently structure this in most languages. In Haskell the signature was something like
data Z = Z Word64 Word64
zcurve :: Monad m => (Z -> m ()) -> m ()
And the compiler was able to turn this into a tight assembly loop. The sort of inversion of control here, where your business logic has to migrate to the bottom of a bunch of stack frames, seems like it might be hard to structure in a language like C++. You could use an explicit stack data structure and a loop, but I wonder if that would hurt performance.