Remember that we can’t get a direct comparison between architectures based on this change to the MacBook lineup, as there was also a huge change in process node. I believe the Intel chips were on 14nm, M1 is on 5nm. Shrinking the size of the chips is likely responsible for most of the performance and efficiency gains Apple is reporting, versus any differences in architecture.
I don’t think this is the exact paper you’re looking for, but a paper in a similar theme is “Scalability! But at what COST?”, a perennial favorite with my team. It is a fairly snarky exploration of how the pursuit of “scalability” for its own sake has meant many distributed processing systems are in fact slower than a reasonably fast naive implementation running on a single thread.
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
Yes, this is a good paper!