This may be a dumb question, but does adding serialization/deserialization greatly increase the latency of the RAM? Won’t there always be a benefit in keeping RAM directly attached?
Hmmm. Could you share your machine specs and measurements in more detail? On my less high end machine from 2017 main memory latency is ~60ns. And from personal experience I’d be shocked if any recent CPU had >100ns main memory latency.
What does directly attached mean? Suppose main memory is connected to the CPU cores via a cache, another cache and a third cache, is that directly attached? Suppose main memory is distant enough, latent enough, that a blocking read takes up as much time as executing 100 instructions, is that directly attached?
It’s a question of physics, really: How quickly can you send signals 5cm there and 5cm back? Or 10cm, or 15cm. Modern RAM requires sending many signals along slightly different paths and having them arrive at the same time, and “same time” means on the time scale that light takes to travel a few millimeters. Very tight time constraints.
(Almost two decades ago we shifted from parallel interfaces to serial ones for hard drives, AIUI largely to get rid of that synchronisation problem although I’m sure the narrower SATA cables were more convenient in an everyday sense too.)
Even DIMMs have a page and row selection mechanisms that introduce variable latency depending on what you want to access. Add 3 levels of caching between that and the CPU and it’s rather likely that with some slightly larger caches somewhere and much higher bandwidth (as is the promise of independent lanes of highly tuned serial connections) you more than compensate for any latency cost incurred by serialization.
Also, memory accesses are per cache-line (usually 64 byte) these days, so there’s already some kind of serialization going on when you try to push 512 bits (+ control + ECC) over 288 pins.
Seems like that future is now as well; as noted in the footnotes, the imx8m requires some SPL to load this trainer in the LP-DDR4 external memory and have to execute some self extracting binary that shows some EULA to even get the trainer.
Modern PCs/Workstations are just scaled down Mainframes, or on the way there. I’m still waiting for optical cables connecting the CPU to drives, extension cards, network cards, maybe even RAM.
This may be a dumb question, but does adding serialization/deserialization greatly increase the latency of the RAM? Won’t there always be a benefit in keeping RAM directly attached?
10-15ns penalty on Power10 because of the externally attached DRAM controller. It’s just a footnote.
10-15ns hardly constitutes a footnote for main memory latency.
On the machine that I write this on, latency to DRAM is 170ns. It’s a high-end multi-socket capable CPU from 2018.
It matters far less there than on most customer workloads.
Hmmm. Could you share your machine specs and measurements in more detail? On my less high end machine from 2017 main memory latency is ~60ns. And from personal experience I’d be shocked if any recent CPU had >100ns main memory latency.
Client processors have far lower memory latency than server ones.
It’s nearly impossible to find a server CPU with below 100ns of memory latency. But at least, there’s plenty of bandwidth.
A random example from a machine (not my daily, which isn’t using AMD CPUs): https://media.discordapp.net/attachments/682674504878522386/807586332883812352/unknown.png
Wow! That’s the same as if the memory was on the other side of the room.
What does directly attached mean? Suppose main memory is connected to the CPU cores via a cache, another cache and a third cache, is that directly attached? Suppose main memory is distant enough, latent enough, that a blocking read takes up as much time as executing 100 instructions, is that directly attached?
It’s a question of physics, really: How quickly can you send signals 5cm there and 5cm back? Or 10cm, or 15cm. Modern RAM requires sending many signals along slightly different paths and having them arrive at the same time, and “same time” means on the time scale that light takes to travel a few millimeters. Very tight time constraints.
(Almost two decades ago we shifted from parallel interfaces to serial ones for hard drives, AIUI largely to get rid of that synchronisation problem although I’m sure the narrower SATA cables were more convenient in an everyday sense too.)
Even DIMMs have a page and row selection mechanisms that introduce variable latency depending on what you want to access. Add 3 levels of caching between that and the CPU and it’s rather likely that with some slightly larger caches somewhere and much higher bandwidth (as is the promise of independent lanes of highly tuned serial connections) you more than compensate for any latency cost incurred by serialization.
Also, memory accesses are per cache-line (usually 64 byte) these days, so there’s already some kind of serialization going on when you try to push 512 bits (+ control + ECC) over 288 pins.
It does! It’s not in the mainline source tree, but is tagged. Here is what seems to be the most recent version.
“Note that a full binary cannot be generated from this source.”
Seems like that future is now as well; as noted in the footnotes, the imx8m requires some SPL to load this trainer in the LP-DDR4 external memory and have to execute some self extracting binary that shows some EULA to even get the trainer.
Modern PCs/Workstations are just scaled down Mainframes, or on the way there. I’m still waiting for optical cables connecting the CPU to drives, extension cards, network cards, maybe even RAM.