As far as I can tell, the speedup with learned indexes had little to do with the learning, and a good deal to do with the ease of implementing them in parallel. With massive parallelism comes the ability to speed up the algorithm by using GPUs.
Classical data structures with parallel lookups should work just as well (if not far better) than learned indexes in this case.
The execution time of 50-ish nanoseconds is already in the range where memory latency may be dominant. If the mythical TPUs that execute a sufficiently large NN in a single cycle were glued on CPUs, perhaps the learned indexes could shave off a few cycles. AIUI using GPUs as they are now would make little sense because of massive latency.
Now could we have useful hash functions that execute in a single cycle?
There are already many fast hash functions (clhash and falkhash) some of which take advantage of specialized hardware instructions.
It doesn’t seem worth it to add dedicated instructions when there are existing algorithms that hash at a fraction of a cycle per byte.
The things I’ve seen before are usually measured in cycles per byte with a low number of cycles. So, what do you mean execute in a single cycle? Are you saying you want it a byte a cycle, a specific number of bytes done in a cycle on a CPU, or that on a parallelized machine like a GPU? And how good does that hash function have to be in terms of collisions? Can it be like the textbook examples people use in day to day programming or a unique as cryptographic kinds provide?
I wasn’t looking for an actual answer to that. The particular numbers selected for the examples are somewhat arbitrary.
My line of thought was that if we’re doing hardware acceleration, then maybe instead of throwing a big neural net and a 120 teraflops TPU at the problem [and be still restrained by memory latency], we could use cuckoo hashes (or whatever) and accelerate some other link in the chain with far less hardware. A fast and secure hardware-accelerated hash function could actually prove useful, but that’s just one possibility.
A fancy version could take a base address and a key and compute some number of hashes, fetch all the resulting addresses from memory, and then compare all the keys in parallel. Call it crazy for a CPU instruction. Such a massive TPU is pretty crazy too, and I’m not yet 100% convinced we’re going to get them on the CPU die :-)