Indeed a fascinating read. Am I correct in assuming that the prefetch circuitry will always continue to load up to 6 bytes ahead, regardless of memory area? If so, a system design had to be careful to not put memory mapped I/O with side effect closely after a memory area with code. The windows for this to happen is pretty small and then there’s the fact that the 8086 has an I/O space separate from memory, so maybe this was not seen as an issue.
That’s fascinating. If I understood correctly, writes went via the same unit that managed the queue, so self-modifying code could store into an address that had been prefetched and still work. That causes quite a headache for later implementations: you had to be able to handle any stores overwriting the instruction cache. With SMP, this included stores from other cores. I believe modern implementations made this a very slow path by detecting it late and doing a complete flush of all in-speculation state and resuming from the committed store. Later architectures gained a lot of simplicity by requiring explicit fences to synchronise instruction fetch with data writes (everyone except RISC-V learned that this became a huge perf hit if you didn’t make it a broadcast cache invalidate, rather than a core-local operation).
Indeed a fascinating read. Am I correct in assuming that the prefetch circuitry will always continue to load up to 6 bytes ahead, regardless of memory area? If so, a system design had to be careful to not put memory mapped I/O with side effect closely after a memory area with code. The windows for this to happen is pretty small and then there’s the fact that the 8086 has an I/O space separate from memory, so maybe this was not seen as an issue.
That’s fascinating. If I understood correctly, writes went via the same unit that managed the queue, so self-modifying code could store into an address that had been prefetched and still work. That causes quite a headache for later implementations: you had to be able to handle any stores overwriting the instruction cache. With SMP, this included stores from other cores. I believe modern implementations made this a very slow path by detecting it late and doing a complete flush of all in-speculation state and resuming from the committed store. Later architectures gained a lot of simplicity by requiring explicit fences to synchronise instruction fetch with data writes (everyone except RISC-V learned that this became a huge perf hit if you didn’t make it a broadcast cache invalidate, rather than a core-local operation).