This talk and the outline of the Zen2 microarchitecture, both frontend and backend, is exactly what I had been looking for. It’s quite interesting that all direct unconditional branch instructions experience SLS on the Zen1 and Zen2. I would’ve thought the mitigations would be akin to using lfence to serialize everything up until the branch, or overwriting the RSB or BTB doing the usual add rsp, 8 jmp/call [repeat_block_N]. I haven’t looked into the latest AMD processors, but I’d hope the majority have IBPB/IBRS/STIBP controls are available to software for enable.
I suppose another mitigation, likely not mentioned in the interest of time/relevance, would be converting all indirect branches into serializing instruction sequences… though this would likely be done using lfence after a load and prior to the next dispatch, and lfence isn’t exactly cheap.
Great talk, thanks for sharing the slides. If anyone is interested in the BPU there are loads of resources, but I’ve found the most useful material on SemanticScholar and some interesting info on the RAS from StuffedCow: