1. 38
  1.  

  2. 12

    I contra-agree with the author: pulling on the thread of disabling all performance variable optimizations will ultimately slow down processors, but I think it’s time we bite that bullet. We should be planning for architectures with very many dumb cores and gpus, and using programming languages and compilers that can play well with such architectures.

    I’m thinking things where very small footprint “actors” or coroutines (bundles of code and small, cacheable state) are convenient to write, where math is generally expressed as matrices, and hot sections can be marked for automatic (jit) execution on the gpu when it’s likely to be faster.

    I’m not aware of any programming language or compiler that would be perfect for this, but I think we have a number that have many of these features.

    Yes a bunch of existing code is going to be slower, but it already will be as more and more processor optimizations are disabled.

    1. 11

      With a growing percentage of the world’s energy output going into computing of some form or another, that’s hardly an eco-friendly approach. Optimization doesn’t just mean “faster” in the sense of satisfying someone’s impatience, it means doing less work. It means running fewer machines to serve 10,000 RPS. It means putting fewer charge cycles on someone’s phone battery so it lasts a little longer before it becomes trash. Et cetera. There have to be better approaches than inflicting deliberate brain damage.

      1. 4

        No-one’s saying do more work. I’m saying use architectures that are efficient when used differently from current architectures. And I say that because current architectures can only achieve speed (wall clock, latency) at the cost of security holes.

        1. 5

          Remember we tried this with Itanium, where rather than having a smart processor pipeline figure out dependencies and execute instructions in parallel, the compiler would emit bundles of instructions that could be executed in parallel. The unfortunate outcome is that this required new compilers to do better at expressing parallelism than mature processors, with the result that performance was uncompetitive out of the gate, and the whole architecture died.

          There’s a big difference between optimal eventual theoretical outcomes and practical paths to achieve them.

          1. 2

            Right but as all the fancy optimizations get disabled because of sidechannel attacks that performance gap will close.

            1. 1

              The other impression I got about Itanium was the level of optimisation deferred to code meant that you would have an existential problem where it would be difficult/impossible for future architectural redesigns/clean-room designs by competitors to be meaningfully better running the same code because the optimisations in code may not hold for a drastically different design. Not a CPU architect though.

          2. 4

            Optimization doesn’t just mean “faster” in the sense of satisfying someone’s impatience, it means doing less work

            Optimisations at the CPU level are almost always a power / throughput / latency tradeoff. Speculative execution, by definition, consumes more power because it is doing some work that will be thrown away. A pipeline without speculative execution will always consume less energy than one that does speculative execution[1] because it is only executing the instructions that are actually required, whereas the speculative one will execute a superset of these instructions. Out-of-order execution requires register rename logic, which consumes power in exchange for higher throughput and lower latency.

            To @Student’s point, with Verona we think we have a programming model that will make it natural to express very high degrees of parallelism and let us saturate cores designed like the UltraSPARC T1 (Niagara), which had simple in-order pipelines and high-order SMT so that if it hit a branch or a cache miss it could always run another thread. It didn’t perform very well with most C/C++ codebases because these languages generally don’t expose very high orders of concurrency. It was not even great for Java because, while Java programs generally have a load of threads, they generally have a much smaller number of runnable threads.

            [1] Which doesn’t necessarily mean that the entire CPU will, because the speculative one may be able to complete a burst workload faster and then turn off caches and go to sleep.

            1. 2

              The when construct in Verona looks like goroutines in go, but disciplined. Very interesting.

              1. 2

                It’s similar, in that it dispatches some work to complete asynchronously, but different in that it provides strong causal ordering guarantees and enforces dataflow. when (a, b) {...} in Verona will cause the closure in the braces to defer execution until it has exclusive access to a and b (with a guarantee that both a and b can be acquired at some point in the future, without deadlock).

                The model is actually closer to Erlang or Pony in some ways. If a when clause has a single cown that it requires then it is as if that cown is an actor. when (a) {...} is equivalent to sending a message to actor a that contains the closure in the braces. The generalisation of the actor model for Verona is that you can when over multiple cowns, whereas there’s no concept in actor-model programming of ‘send a message to two actors that allows you to access the state of both actors when it is received’.

                The idea is that your code looks like sequential code with straight-line data flow, but ends up being executed in parallel. We’ll see how that actually works once the compiler is finished…

        2. 3

          Is there any evidence that this is actually exploitable? The signal should be very weak: you can spot one possible value out of 512. The only case we were able to think of was in the case of public-key crypto, using big number (e.g. 2048-bit) arithmetic. With 2048-bit arithmetic, each value is 4 cache lines and if you can contrive plain text and detect that one of the operations on the plaintext value combined with the key gives a zero value for 512 if the bits that they use then you might have some possible attacks. Some neural network workloads with dense matrix representations may allow partial model stealing attacks if you can find the runs of zeroes easily (though the bit that you probably care about is the pattern of zeroes and ones in the not-completely-zero bits. For anything else, it doesn’t seem like it leaks a meaningful amount of data (you probably have a lot of hardware and software side channels that provide a stronger signal). I’d love to see counterexamples, if anyone has them.

          1. 2

            One thing that bothers me about fixes like this is that they take a one-size-fits-all approach. If I am operating a multiuser server that allows shell access by customers I don’t know, then I definitely want to patch this kind of vulnerability to protect my users from each other. But if I am running Ubuntu on my personal desktop machine, slowing all my code down to protect against another logged-in user messing with my processes is pointless: if there is ever another user logged in at all, I already have bigger problems to worry about. In that case I want all the optimizations Intel or AMD can dream up, even if they leak internal CPU state like a sieve.

            1. 6

              The problem is that your personal desktop machine presumably executes JavaScript on websites, which is enough to trigger Rowhammer. These sorts of low-level attacks can bypass a lot of the security model that modern computing relies on.

              1. 2

                It protects you against malware running as you as well. The deal with a timing attack is that it may not require a deal of privileges.

                That said, you should be free to choose speed and your own set of risk mitigations.

              2. 1

                Anyone got a link to the running scoreboard of Intel vs AMD and others for mitigation compromises on performance due to CPU vulnerabilities?