1. 43
  1. 36

    Better title, “don’t just check performance on the highest-end hardware.” Applies to other stuff too, like native apps — developers tend to get the fastest machines, which means the code they’re building always feels fast to them.

    During the development cycle of Mac OS X (10.0), most of the engineers weren’t allowed to have more than 64MB of RAM, which was the expected average end-user config — that way they’d feel the pain of page-swapping and there’d be more incentive to reduce footprint. I think that got backpedaled after a while because compiles took forever, but it was basically a good idea (as was dog-fooding the OS itself, of course.)

    1. 4

      Given that the easy solution is often the most poorly performing and that people with high-end hardware have more money and thus will be the majority of your revenue, it would seem that optimising for performance is throwing good money after bad.

      You are not gonna convince websites driven by profit with sad stories about poor people having to wait 3 extra seconds to load the megabytes of JS.

      1. 6

        depends on who your target audience is. If you are selling premium products, maybe. But then still, there are people outside of tech who are willing to spend money, just not on tech. So I would be very careful with that assumption.

        1. 2

          It’s still based on your users and the product you sell. Obviously Gucci, Versace and Ferrari have different audiences but the page should still load quickly. That’s why looking at your CrUX reports and RUM data helps with figuring out who you think your users are and who’s actually visiting your web site.

          I don’t own a Ferrari but I still like to window shop. Maybe one day I will. Why make the page load slow because you didn’t bother to optimize your JavaScript?

        2. 5

          These days your page performance (e.g. Core Web Vitals) is an SEO factor. For public sites that operate as a revenue funnel, a stakeholder will listen to that.

          1. 3

            I don’t work on websites, but my understanding is that generally money comes from ad views, not money spent by the user, so revenue isn’t based on their wealth. I’m sure Facebook’s user / viewer base isn’t mostly rich people.

            Most of my experience comes from working on the OS (and it’s bundled apps like iChat.) it was important that the OS run well on the typical machines out in the world, or people wouldn’t upgrade, or buy a new Mac.

            1. 2

              Even if you were targeting only the richest, relying on high-end hardware to save you would be a bad strategy.

              • Mobile connections can have crappy speeds, on any hardware.
              • All non-iPhone phones are relatively slow, even the top-tier luxury ones (e.g. foldables). Apple has a huge lead in hardware performance, and other manufacturers just can’t get equally fast chips for any price.
              • It may also backfire if your product is for well-off people, but not tech-savvy people. There are people who could easily afford a better phone, but they don’t want to change it. They see tech upgrades as a disruption and a risk.
            2. 3

              I’ve heard similar (I believe from Raymond Chen) about Windows 95 - you could only have the recommended spec as stated on the box unless you could justify otherwise.

              1. 2

                It would be very useful if the computer could run at full speed while compiling, throttling down to medium speed while running your program.

                1. 1

                  Or you use a distributed build environment.

                  1. 1

                    If you use Linux, then I believe this can be accomplished with cgroups.

                  2. 2

                    They might have loved a distributed build system at the time. :) Compiling on fast boxes and running the IDE on slow boxes would’ve been a reasonable compromise I think.

                    1. 1

                      most of the engineers weren’t allowed to have more than 64MB of RAM,

                      Can OS X even run on that amount of ram?

                      1. 15

                        OS X 10.0 was an update to OPENSTEP, which ran pretty happily with 8 MiB of RAM. There were some big redesigns of core APIs between OPENSTEP and iOS to optimise for power / performance rather than memory use. OPENSTEP was really aggressive about not keeping state for UI widgets. If you have an NSTableView instance on OPENSTEP, you have one NSCell object (<100 bytes) per column and this is used to draw every cell in the table. If it’s rendering text, then there’s a single global NSTextView (multiple KiB, including all other associated state) instance that handles the text rendering and is placed over the cell that the user is currently editing, to give the impression that there’s a real text view backing every cell. When a part of the window is exposed and needs redrawing, the NSCell instances redraw it. Most of the objects that are allocated on the drawing path are in a custom NSZone that does bump allocation and bulk free, so the allocation is cheap and the objects are thrown away at the end of the drawing operation.

                        With OS X, the display server was replaced with one that did compositing by default. Drawing happened the same way, but each window’s full contents were stored. This was one of the big reasons that OS X needed more RAM than OPENSTEP. The full frame buffer for a 24-bit colour 1024x768 display is a little over 2 MiB. With OPENSTEP, that’s all you needed. When a window was occluded, you threw away the contents and drew over it with the contents of the other window[1]. With OS X, you kept the contents of all windows in memory[2] . If you’ve got 10 full-screen windows, now you need over 20 MiB just for the display. In exchange for this, you get faster UI interaction because you’re not having to redraw on expose events.

                        Fast forward to the iPhone era and now you’ve got enough dedicated video memory that storing a texture for every single window was a fairly negligible impact on the GPU space and having 1-2 MiB of system memory per window to have a separate NSView instance (even something big like NSTextView) for every visible cell in a table was pretty negligible and the extra developer effort required to use the NSCell infrastructure was not buying anything important. To make matters worse, the NSCell mechanisms were intrinsically serial. Because every cell was drawn with the same NSCell instance, you couldn’t parallelise this. In contrast, an NSView is stateful and, as long as the controller / model support concurrent reads (including the case that they’re separate objects), you can draw them in parallel. This made it possible to have each NSView draw in a separate thread (or on a thread pool with libdispatch), spreading the drawing work across cores (improving power, because the cores could run in a lower power state and still be faster than one core doing all of the work in a higher power state, with the same power envelope). It also meant that the result of drawing an NSView could be stored in a separate texture (CoreAnimation Layer) and, if the view hadn’t changed, be composited very cheaply on the GPU without needing the CPU to do anything other than drop a couple of commands into a ring buffer. All of this improves performance and power consumption on a modern system, but would have been completely infeasible on the kind of hardware that OPENSTEP or OS X 10.0 ran on.

                        [1] More or less. The redraws actually drew a bit more than was needed that was stored in a small cache, because doing a redraw for every row of column of pixels that was exposed was too slow, asking views to draw a little bit more and caching it meant that you make it appear smooth as a window was gradually revealed. Each window would (if you moved the mouse in a predictable path) draw the bit that’s most likely to be exposed next and then that would be just copied into the frame buffer by the display server as the mouse moved.

                        [2] Well, not quite all - if memory was constrained and you had some fully occluded windows, the system would discard them and force redraws on expose.

                        1. 1

                          Thanks for this excellent comment. You should turn it into a mini post of it’s own!

                        2. 3

                          Looks as if the minimum requirement for OS X 10.0 (Cheetah) was 128 MB (unofficially 64 MB minimum).

                          1. 2

                            Huh. You know, I totally forgot that OS X first came out 20 years ago. This 64M number makes a lot more sense now :)

                          2. 1

                            10.0 could, but not very well; it really needed 128MB. But then, 10.0 was pretty slow in general. (It was released somewhat prematurely; there were a ton of low-hanging-fruit performance fixes in 10.1 that made it much more useable.)

                        3. 8

                          Interesting read! I had no idea there was such a wide gulf between iPhone and non-iPhone performance. When I switched to an iPhone a year or two ago I definitely noticed that everything felt faster, but the magnitude of the difference is actually a little shocking.

                          1. 4

                            This is not far from why the M1 chip’s effective performance surprised many people, while for us iOS developers it was simply a bigger leap than we expected. The A-series chips have been improving quickly and steadily for a decade.

                          2. 3

                            I actually chuckled. This is seriously a self aware wolf moment. This guy is so very, very close to realizing how to fix the problem but is skipping probably the most important step.

                            He mentioned single-core performance at least 5 times in the article but completely left out multi-core performance. Even the Moto E, the low end phone of 2020, has 8 cores to play with. Granted, some of them are going to be efficiency/low performance cores but 8 cores, nonetheless. Utilize them. WebWorkers exist. Please use them. Here’s a library that makes it really easy to use them as well.


                            Here’s a video that probably not enough people have watched.

                            The main thread is overworked and underpaid

                            1. 7

                              The article claims the main performance cost is in DOM manipulation and Workers do not have access to the DOM.

                              1. 1

                                if you’re referring to this:

                                Browsers and JavaScript engines continue to offload more work from the main thread but the main thread is still where the majority of the execution time is spent. React-based apps are tied to the DOM and only the main thread can touch the DOM therefore React-based apps are tied to single-core performance.

                                That’s pretty weak. Any javascript application that modifies the DOM is tied to the DOM. It doesn’t mean the logic is tied to the DOM. If it is then at least in react’s case it means that developers thought rendering then re-rendering then rendering again was a good application of user’s computing resources.

                                I haven’t seen their code and I don’t know what kinds of constraints they’re being forced to program under but react isn’t their bottleneck. Wasteful logic is.

                                1. 2

                                  The author’s point is that a top of the line iPhone can mask this “wasteful logic”. Unless developers test their websites on other, less expensive, devices they may not realize that they need to implement some of your suggested fixes to achieve acceptable performance.

                                  1. 1

                                    You’re right. I missed the point when I read into how he was framing the problem. Excuse me.

                              2. 3
                                1. iPhones also have many cores, so that’s not going to bridge the gap.

                                2. From TFA: “Browsers and JavaScript engines continue to offload more work from the main thread but the main thread is still where the majority of the execution time is spent.”

                                3. See also: Amdahl’s Law

                                1. 1

                                  Gonna fight you on all of these points because they’re a bunch of malarkey.

                                  iPhones also have many cores, so that’s not going to bridge the gap.

                                  If you shift the entire performance window up then everyone benefits.

                                  From TFA: “Browsers and JavaScript engines continue to offload more work from the main thread but the main thread is still where the majority of the execution time is spent.”

                                  This shouldn’t be the case. If it is then people are screwing around and running computations in render() when everything should be handled before that. Async components should alleviate this and react suspense should help a bit this but right now I use Redux Saga to move any significant computation to a webworker. React should only be hit when you’re hydrating and diffing. React is not your bottleneck. If anything it should have a near constant overhead for each operation. You should also note that the exact quote you chose does not mention react but all of javascript. Come on.

                                  See also: Amdahl’s Law

                                  I did. Did you see how much performance you gain by going to 8 identical cores? It’s 6x. Would you consider that to be better than only having 1x performance? I would.

                                  1. 1

                                    Hmm..if you’re going to call what I write “malarky”, it would help if you actually had a point. You do not.

                                    If you shift the entire performance window up then everyone benefits.

                                    Yep, that’s what I said. If everyone benefits, it doesn’t close the gap. You seem to be arguing against something that nobody said.

                                    Amdahl’s law … 8 identical cores? 6x speedup

                                    Er, you seem to not understand Amdahl’s Law, because it is parameterised, and does not yield a number without that parameter, which is the portion of the work that is parallelizable. So saying Amdahl’s law says you get a speedup of 6x from 8 cores is not just wrong, it is non-sensical.

                                    Second, you now write “8 identical cores”. I think we already covered that phones do not have 8 high performance cores, but at most something like 4/4 high/efficiency cores.

                                    Finally, even with an exceedingly rare near perfectly parallelisable talks that kind of speedup compared to a non-parallel implementation is exceedingly rare, because parallelising has overhead and also on a phone other resources such as memory bandwidth typically can’t handle many cores going full tilt.

                                    but the main thread is still where the majority of the execution time is spent

                                    This shouldn’t be the case … React …

                                    The article doesn’t talk about what you think should be the case, but about what is the case, and it’s not exclusively about React.

                              3. 1

                                I did a double take initially, because early iPhones are my responsive layout minimum spec for screen width.