This article starts talking about something interesting but it looks as if the author hit publish when it was in an early design stage.
The idea that it’s talking about is having a type-1 hypervisor that sits below any kernel and exposes functionality for managing VMs into either one privileged guest or multiple guests. This is the model that Xen used from the start, with either Linux or NetBSD typically filling the role of the privileged guest. There was some work to allow domU to launch VMs by delegating resources assigned to it, but I’m not sure how far this went.
Windows also follows this model. Hyper-V implements a public spec and, in theory, Windows can use any hypervisor that provides the same interface. This is important because Windows actually relies on the hypervisor for some security functionality. The Hyper-V design has a notion of a virtual trust level, effectively giving a set of orthogonal privilege modes within a VM such that a VM can drop privilege for most of the kernel and retain strong isolation guarantees for the rest of it. Things like the credential manager run at a higher VTL so that a compromise in the kernel doesn’t automatically leak kernel-held secrets (though it does expose the APIs for accessing them, so an attacker may still be able to privilege elevate). Various other monitoring things live at this level.
Android with Halfnium has a similar model, where there’s a small hypervisor that is designed to allow components such as the credential store to be isolated from compromises of the Linux kernel.
The big difference between Xen and the other two is that Xen ships a scheduler in the hypervisor. Hyper-V and Halfnium both place their trusted guest in the TCB for availability, even when in modes that isolate it for confidentiality and integrity. This makes sense on Android because if Linux crashes then it doesn’t really matter if any of the other services work. It also makes sense in the Windows model (on the client, at least) for similar reasons.
The Arm Realms model is somewhat different. Arm talks about Realms as if they’re building a hardware confidential computing solution but they’re actually doing something a lot more sensible: providing hardware acceleration for a privilege-separated hypervisor. The RMM is an absolutely minimal hypervisor that provides guarantees about realm (guest) isolation (pages are either private to a realm, or shared and the realm is aware that they’re shared). The RMM is not responsible for allocating memory or scheduling. A hypervisor in EL2 (outside of the Realm World) must find pages to allocate to a realm, pass them to the RMM (at which point it loses any ability to access them), ask the RMM to create a realm with access to those pages, and add VCPUs to that realm, and so on. The EL2 hypervisor is trusted to schedule realms (it also has access to the reset line for the system, so there’s no way of removing it from the TCB for availability). If a realm needs to be paged out or migrated, the RMM is responsible for providing encrypted and integrity-protected copies of the pages to EL2-owned memory, but EL2 is responsible for swapping them out or transferring them to a new machine that can start the realm again. The idea is that the EL2 hypervisor will be feature-rich and large, but only the code in the RMM is able to violate realm-isolation guarantees and the RMM is small enough that it could be formally verified (I don’t know of any plans to do that yet).
Probably assumed too much familiarity with the area when writing it…
The big difference between Xen and the other two is that Xen ships a scheduler in the hypervisor. Hyper-V and Halfnium both place their trusted guest in the TCB for availability, even when in modes that isolate it for confidentiality and integrity. This makes sense on Android because if Linux crashes then it doesn’t really matter if any of the other services work. It also makes sense in the Windows model (on the client, at least) for similar reasons.
The root domain is still always required to be functional however, if only for VMM tasks and management.
However, omitted Hyper-V shielded VMs from the article because Microsoft didn’t implement them properly from the security perspective, resulting in relatively easy security guarantee breakage… :-/
Windows also follows this model. Hyper-V implements a public spec
The TLFS isn’t the most complete thing in the world sadly.
and, in theory, Windows can use any hypervisor that provides the same interface
In practice too. :-) (at least for a certain subset)
The Hyper-V design has a notion of a virtual trust level, effectively giving a set of orthogonal privilege modes within a VM such that a VM can drop privilege for most of the kernel and retain strong isolation guarantees for the rest of it.
VTL1 is an interesting design (with an equally cursed SecureKernel implementation that has its own downsides)… Apple has a hardware-assisted implementation of the concept with PPL (which uses the GXF lateral privilege level ISA extension) - https://blog.svenpeter.dev/posts/m1_sprr_gxf/.
In practice however, it fills the same role as an enclave in the way that Microsoft currently uses it.
Hafnium
Gunyah on the Qualcomm side is a practical implementation of the design, and is what is shipped today there. The downside of not having a lateral privilege level Realms-style of course is that you lose EL2 access for Linux on those Qualcomm platforms.
Looking forward to the next installment. The value of the Arm partnership really can’t be overstated. The agreement requires that all members review any proposal for inclusion in the architecture for patent infringement and, if it infringes any of their patents, either disclose it or grant every other partner a license for use in compliant Arm implementations. This means that any implementer has access to a massive patent portfolio.
Some of the partnerships in this article very nearly set Arm up for failure. Fragmentation killed MIPS and almost killed Arm. For a time, there were three incompatible floating-point extensions on Arm (there were more on MIPS), which meant that everyone who cared about binary compatibility ended up using a soft-float ABI. This didn’t matter in the ‘90s because companies like Nokia compiled everything for a specific phone SoC, but it mattered once third-party application ecosystems started to emerge. It’s unclear whether it still matters: Apple requires developers to upload LLVM IR to the app store now, so they can easily compile different versions of apps for different hardware features (as the did for PAC, for example).
Apple only required Bitcode for Watch. But even on that it’s gone this year. If you try to submit a bitcode app today, App Store Connect just refuses and tells you to re-upload as a regular binary.
Apple Silicon machines are designed first and foremost to provide a secure environment for typical end-users running macOS as signed by Apple; they prioritize user security against third-party attackers, but also attempt to limit Apple’s own control over the machines in order to reduce their responsibility when faced with government requests, to some extent. In addition, the design preserves security even when a third-party OS is installed.
… these machines may possibly qualify as the most secure general purpose computers available to the public which support third-party OSes, in terms of resistance to attack by non-owners.
If you run a third-party OS on a Chromebook, doesn’t that severely compromise the security of the Chrome OS system? If I remember correctly, many Chromebooks required you to take out a screw to install another operating system and the process prevented secure boot from functioning on the primary Chrome OS installation.
What’s nice about Apple Silicon Macs (from my understanding) is that their secure boot settings are per-OS, not systemwide. You can still perform all of the signature checks on a macOS installation without doing so on a Linux system on the same disk.
Without some kind of physical intervention by users doesn’t that leave macs vulnerable to a persistent attack? Like an evil maid or trojan that installs something like a keylogging hypervisor that boots regular macOS. That would be indistinguishable from the perspective of the user and probably macOS yet could easily be malicious.
reboot again because you forgot which buttons you needed to press on the keyboard :D
press correct buttons during boot
Enter the recovery OS
Enter the administrator password
Change the security setting
That said, I had to work on a chromebook for a while and that didn’t require a screw or anything to get into the unsafe mode, it was also a key chord.
There are a few critical differences though:
Changing to the insecure mode on a Chromebook erases all local content
From the article it sounds like beyond allowing you to launch a untrusted OS the security features are available to multiple OS’s (this is purely my reading of the article, I could very well be wrong). Whether linux or what have you support/use it i don’t know.
You’ve now shifted the goalpost from your original question (original goalpost was “vulnerable to a persistent attack” due to not requiring something similar to Chromebooks’ screw removal, new goalpost is alleging flaws in the SEP). I’ll no longer be responding to you.
Up to the Apple A10 by the checkra1n jailbreak (to bypass the measurement by the SEP used to lock data access on access to DFU for more recent iOS releases).
On the Apple A13 onwards, the measurement of the current SEP firmware version (by the monitor) is a component of the encryption key, making such attacks no longer able to have user data access.
Nvidia’s proprietary compiler (NVCC) is actually just a wrapper around the host’s compiler and LLVM and reverse-engineering it isn’t that hard (in fact Google did it to implement CUDA support in LLVM and I did it in order to add support for CUDA in GNAT, GCC’s Ada frontend).
Yes, I found it really odd that you could get so much information about the ISA, the execution model and even the various steps the toolchain goes through but nothing about the transformations happening to the source code. Fortunately gcc -E and nvcc --verbose --keep help a lot there :).
The interactions between AWS and companies behind open source databases are always going to devolve into scuffles that hurt the community.
The article quotes a tweet that points out how Elastic is “in it for the money”. Well yeah. The same exact thing is also true of AWS though, the only difference being that AWS is already making most of the money (by far, and same with every other OSS DB that they offer in DBaaS form) and at the same time Amazon is the company where employees have to pee in bottles while sales data from the main Amazon business is used to identify successful products and clone them.
In other words, Amazon wants the money and it also needs to control how its brand is perceived by the public, making their actions no less empty than those of Elastic, aside from the fact that Elastic is not as good at this game, apparently.
(disclaimer: am an engineer at AWS, but working on something totally unrelated to this)
About the “in it for the money” part:
When a company makes their product open-source, they explicitly renounce their monopoly on the commercial exploitation of that product, in exchange of greater adoption.
It’s not a light decision to take and comes with long-term consequences. Wanting to close the product afterwards is trying to have their cake and eat it too, after the market has already adopted it.
If they wanted others to not commercially use/offer the product, their choice was to keep it closed-source. It might not have been anywhere near as popular in that case however.
And I don’t see how your complaints about some things happening in the Amazon retail org affect AWS. :)
This may be a dumb question, but does adding serialization/deserialization greatly increase the latency of the RAM? Won’t there always be a benefit in keeping RAM directly attached?
Hmmm. Could you share your machine specs and measurements in more detail? On my less high end machine from 2017 main memory latency is ~60ns. And from personal experience I’d be shocked if any recent CPU had >100ns main memory latency.
What does directly attached mean? Suppose main memory is connected to the CPU cores via a cache, another cache and a third cache, is that directly attached? Suppose main memory is distant enough, latent enough, that a blocking read takes up as much time as executing 100 instructions, is that directly attached?
It’s a question of physics, really: How quickly can you send signals 5cm there and 5cm back? Or 10cm, or 15cm. Modern RAM requires sending many signals along slightly different paths and having them arrive at the same time, and “same time” means on the time scale that light takes to travel a few millimeters. Very tight time constraints.
(Almost two decades ago we shifted from parallel interfaces to serial ones for hard drives, AIUI largely to get rid of that synchronisation problem although I’m sure the narrower SATA cables were more convenient in an everyday sense too.)
Even DIMMs have a page and row selection mechanisms that introduce variable latency depending on what you want to access. Add 3 levels of caching between that and the CPU and it’s rather likely that with some slightly larger caches somewhere and much higher bandwidth (as is the promise of independent lanes of highly tuned serial connections) you more than compensate for any latency cost incurred by serialization.
Also, memory accesses are per cache-line (usually 64 byte) these days, so there’s already some kind of serialization going on when you try to push 512 bits (+ control + ECC) over 288 pins.
I could see this as wanting to keep the implementation as simple as possible, so the question becomes: would we actually want this safety built in or is it enough to put the whole thing into a “secure box”?
A core design principle of web assembly was that it be able to provide a target for more or less any language. That meant the language object model would be opaque to the VM, it also means life time of an allocation is opaque to the VM. The result of that is that the WASM VMs basic memory model has to be a single blob of addressable memory. Some of this is also because of the Mozilla “JS subset” wasm implementation that operated on typed arrays.
This brings with it other constraints - no builtin standard library, no standard types, no object introspection, and because it’s intended to be used in browser the validation and launch must be fast as caching of generated code is much less feasible than in a native app - hence the control flow restrictions.
The result of all of this is that you can compile Haskell to wasm without a problem, or .NET, or C++ and they all run with the same VM, and none of them incur unreasonable language specific perf penalties (for example you cannot compile Haskell to the CLR or JVM without a significant performance penalty), but compiling to WASM works fine. C/C++ can treat pointers and integers as interchangeably and unsafely as they like, without compromising the browser. And .NET and JVM code can apparently (based on other comments so could be totally wrong here) run in that WASM VM as well.
There’s a significant penalty to languages with significantly different type systems when running under .NET and the JVM. That’s why you tend to get similar, but slightly different, versions of languages - Scala, F#, etc instead of Haskell/*ML - basically the slight differences are changes to avoid the expensive impedance mismatch from incompatible type systems. The real Haskell type system cannot be translated to either .NET or the JVM - even with .NET’s VM level awareness of generic types - and as such takes a bunch of performance hits. Trust me on this.
Similarly compiling C and C++ to .NET requires sacrificing some pointer shenanigans that wasm allows (for better or for worse).
No, they don’t. C++ code compiled with /clr:safe does slow down. (It doesn’t slow down without the option, but it doesn’t provide inside-box safety either.)
Compared to /clr:pure yes, due to some optimisations missed on earlier .NET CLR versions. (the move to Core alleviated most of that overhead… but initially came with a removal of C++/CLI outright before it was added back), and of course, all C#/F# code runs with those checks enabled all the time.
Having the option is always better than not having it for things like this though.
I’d like to buy an M1 Mac for the better battery life and thermals, but I have to run a lot of Linux VMs for my job, so it’s a nonstarter.
If VirtualBox or VMWare or whatever adds support for M1 Macs to run ARM VMs and I could run CentOS in a virtual machine with reasonable performance, it would definitely affect my decision.
(Note that I’d still have to think about it since the software we ship only ships for x86_64, so it would be…yeah, it would probably still be a nonstarter, sadly.)
I do have Parallels running Debian (10, arm64) on an M1. It was a bit weird getting it setup, but it works pretty well now, and certainly well enough for my needs.
Genuine question, why not just have a separate system to run the vm’s? That keeps the battery life nice at the expense of requiring network connectivity but outside of “on an airplane” use cases its not a huge issue i find.
One objection could be that the standard shouldn’t be for U-Boot, but for something more generic. Of course, there are already standards for booting which aren’t specific to U-Boot. Cynically, I think it’s due to secure boot, for which EFI has a better story than U-Boot.
Device Tree is also not a real standard, but it’s instead “whatever Linux does”.
Technically, there is a devicetree specification, but for the rest (e.g. 95% of real device trees) it’s just “whatever Linux does.” FWIW Linux does have good support for doing some traditional things a specification might require (like providing a way to validate one’s device trees), but the semantics of bindings are much more fast-and-loose.
And it turns out that Apple uses device trees too, but an incompatible implementation that pre-dated Device Tree on Linux on Arm in the first place, for their own devices.
(same format set in stone from the iPhone in 2007 to the Apple Silicon Macs today)
Device Tree and FDT are also somewhat different. The only official specification for FDT is the ePAPR specification. I’m the author of the BSD-licensed dtc tool and almost all of the DTS files in the Linux tree now include extensions from the GPL’d dtc tool that are not part of the ePAPR standard. They are not always well documented so some reverse engineering is typically needed. Modern Linux / FreeBSD FDT includes a basic form of dynamic linking (‘overlays’), so you can refer to things in one blob from another. In traditional OpenFirmware FDTs, any of this was handled by Forth code in the expansion module.
SolidRun MACCHIATObin ($349 with a useless 4GB DIMM included, $449 with a 16GB one, there’s a more expensive “Double Shot” version with more Ethernet ports and higher CPU clock.. out of the box, setting a jumper seems to achieve 2GHz on “Single Shot” just fine)
Supported in upstream TianoCore EDK2 + Arm TF-A. Firmware provided by your own build or mine, out of the box comes with U-Boot instead, can boot firmware from microSD or SPI flash or streamed over UART (for recovery).
PCIe controller does support ECAM, but has a bug. It doesn’t filter something about which device the packets are from (I forget the term) so some devices would appear as replicated into all slots (or just a couple slots). The upstream EDK2 workaround is the “ECAM offset” (making the OS only see the last device), but that’s bad because some devices (that use ARI (IIRC) or do their own filtering for some reason otherwise (modern Radeons)) don’t get duplicated, so a Radeon RX 480 would just be completely unseen with the offset, so it’s rolled back in my FW builds. But a Radeon HD 7950 did get duplicated into two slots. Had to patch the FreeBSD kernel to ignore the devices after the first one to test that card :)
There were plans to add setup toggles to upstream EDK2 for the offset, or to add a _HID that would let OSes use a quirk, but idk where that went.
Not the most attractive option now that the LX2160 is out there and with good firmware, you can get a 16-core instead of a 4-core.
and now some random fun stuff that we can’t get our hands on
Gigabyte MP30-AR1 (seems unobtainable by now, maybe keep monitoring ebay for years)
Huawei Kunpeng Desktop Board (not retail, only sold to businesses?)
Official website links to a “Get Pricing/Info” form with a “Budget” field where the smallest value is “less than $50,000”.
A non-standard protocol, Secure Launch, is used by Windows to be able to launch its hypervisor at EL2 on this platform. This protocol isn’t supported by Linux.
Huh, there is a way to run a hypervisor on Qualcomm’s weird firmware at all?! If someone reverse engineers this, would it eventually make KVM on Android phones possible? :D
It’s not the newest product around anymore, and its performance class today is in the same one as a much cheaper Raspberry Pi 4, although that has less I/O expansion.
Huh, there is a way to run a hypervisor on Qualcomm’s weird firmware at all?! If someone reverse engineers this, would it eventually make KVM on Android phones possible? :D
Yup, but remember the “Secure” part, good luck for the signature. (the thing is that Qualcomm has already their hypervisor there for DRM stuff, and so if you just give access to EL2 to anyone their DRM scheme is toast)
This is very impressive computer science done by nvidia engineers. It just sucks that it’s hamstrung by a terrible distribution model. For a vast amount of software, adding a dependency on a proprietary compiler just isn’t going to fly. Especially when OpenCL, OpenGL compute shaders and Vulkan compute shaders all exist. Those obviously provide a worse development experience, but a better user experience (since more GPUs are supported), and – crucially for open-source projects – don’t require proprietary tools to build.
The license agreement contains fun tidbits, such as:
… NVIDIA grants you a … license … to Install and use the SOFTWARE only on computer system(s) running a specific operating system on which the SOFTWARE is designed to run and for which the SOFTWARE is intended to produce an executable image (“Target Systems”)
That just seems evil to me. It means that trying to get the SDK to run on unsupported versions of Linux, or on Windows, or on MacOS, or on any BSD, is a violation of copyright law (at least if that part of the license is actually enforceable).
It also contains:
You shall strictly prohibit the further distribution of the Run-Time Files by users of an End-User Application
So it seems you’re not allowed to compile software using the HPC SDK and then release the binaries under a license which allows a user to redistribute the binary.
You also “agree to notify NVIDIA in writing of any known or suspected distribution or use of the SOFTWARE not in compliance with the requirements of this SLA, and to enforce the terms of your agreements with respect to distributed SOFTWARE”. I don’t know how that would be interpreted exactly, but it sounds a lot like I would be forced to send a written notice to nvidia if I ever see the HPC SDK on pirate sites, or if I see a software project which uses the HPC SDK but doesn’t have the required attribution, etc etc etc.
For a vast amount of software, adding a dependency on a proprietary compiler just isn’t going to fly.
I don’t think that’s the target market, anyway. Considering that they used LULESH from Lawrence Livermore as an example, it seems they’re targeting large HPC installations, and those won’t have any qualms about switching to a proprietary compiler if it delivers a big enough speed up and they don’t distribute binaries so it’s a non-issue for them. According to this technical report they were fine using Intel’s compiler, and whatever proprietary system they needed to compile for the BlueGene super computer.
Yeah, obviously execs at nvidia aren’t stupid, they know what they’re doing. If they thought they could make more money through releasing the compiler under an open source (or just less hostile) license they would have. It’s just sad that this is remarkable feat of computer science gets relegated to a few niche use cases when the technology behind it would have the potential to significantly improve the world of computing.
For context: The NVIDIA HPC SDK is the continuation of the (quite expensive) PGI compiler. It only went from paid at huge prices to free and publicly accessible this August. Currently, the EULA didn’t quite get adapted for that fact yet.
For changing that license that is much less appropriate now that much more people can use it, I’ll ask NVIDIA about it, I cannot provide guarantees though.
With the introduction of macOS on ARM, I was curious how well Windows applications have been updated to support Windows on ARM, which was first released a little over two years ago. Since then, well, it doesn’t look like much progress has been made. Of the popular applications I was able to come up with, only one, VLC, has support for ARM64. I think a large part of this difficulty is the fact that virtually no development platforms support ARM64. Many Windows apps are built with WPF, which doesn’t support ARM64. Even on the Microsoft side, only a handful of applications are compiled natively for ARM64.
I hope that by calling out the lack of support for ARM64 I can help push the platform forward and encourage more applications to release ARM64 versions!
Microsoft doesn’t have fat binaries. That makes a huge difference.
On macOS I press “Build” in Xcode and ship it. Assuming the code was portable, that’s all I need to do. Users don’t even need to know what CPU they have, and apps will continue to work—natively—even when the user copies them to a machine with a different CPU.
For Windows, I need to offer a separate download, and ask users to choose the version for the CPU they have, and then deal with support tickets for “what is CPU and why your exe is broken?” Or maybe build my own multi-arch installer that an ARM machine can run under emulation, but it can still detect and install for the non-emulated CPU. I don’t have time for either of these, so I don’t ship executables for ARM Windows, even though I could build them.
I don’t mean this to be too snarky, but do you test it on both systems?
I already have a multi-arch installer, and my code compiles for ARM64 fine, but I wouldn’t want to update the installer to point to a binary that I’ve never executed. The lack of supported virtualization options is noteworthy here. Right now my only real option is to spend $500-$1000 on an ARM Windows machine for this specific purpose.
Without a Mac devkit it’s hard to be sure, but I’d swear I saw a demo where Xcode can just launch the x64 version under Rosetta, so it becomes possible to test both on one machine. Unfortunately developers need new hardware because there’s no reverse-Rosetta for running ARM code on x64, so porting will still take time.
Unfortunately developers need new hardware because there’s no reverse-Rosetta for running ARM code on x64
I’m not so sure that we really need reverse-Rosetta. The iOS simulator runs x86_64 binaries and is really accurate (except performance-wise). The Apple ecosystem already has an extensive experience of supporting both ARM and x86_64 binaries, and most Macs should be ARM in a few years in anyway. And there is already the ARM Mac mini thingy for developers.
I haven’t, actually. In case of PPC and x86->x64 switches I just bought the new machine and tested only there. I already knew my code worked on the old architecture, so testing on both didn’t seem critical. In Apple’s case these are transitions rather than additions of another platform.
I don’t know if anyone is actually shipping things like this, but it is possible to do this on Windows by building the application as a DLL and then using a tiny .NET assembly that queries the current architecture and then loads and P/Invokes the correct version of the DLL. I saw a proof-of-concept for this a very long time ago, but I don’t think there’s tooling for it.
I’m not really convinced by how Apple does fat binaries. It might be a space saving if the linker could dediplicate data segments, but (last I checked) ld64 didn’t and so you really end up with two binaries within a single file. The NeXT approach was a lot more elegant. Files specific to each OS / Architecture (NeXT supported application bundles that ran on OpenStep for Windows or OpenStep for Solaris as well as OPENSTEP) was in a separate directory within the bundle, along with directories for common files. You could put these on a file server and have apps and frameworks that worked on every client that mounted the share, or you could install them locally and trivially strip out the versions that you didn’t need by just deleting their directories.
The ditto tool on macOS was inherited from NeXT and supported thinning fat bundles and was extended to support thinning fat binaries when Apple started shipping them. That’s a bit awkward for intrusion detection things because it requires modifying the binary and so tooling needs to understand to check signatures within the binary, whereas the NeXT approach just deleted files.
Now that no one runs applications from a file share, the main benefit from fat binaries is during an upgrade. When you buy a new Mac, there’s a migration tool that will copy everything from the old machine to the new one, including applications. With a decent app store or repo infrastructure, such a tool would be able to just pull down the new versions. Honestly, I’d much rather that they just extended the metadata in application and library bundles to include a download location and hash of the versions for other architectures. Then you could go and grab them when you migrated to a different system but not waste bandwidth and disk space on versions that you don’t need.
Now that no one runs applications from a file share, the main benefit from fat binaries is during an upgrade. When you buy a new Mac, there’s a migration tool that will copy everything from the old machine to the new one, including applications. With a decent app store or repo infrastructure, such a tool would be able to just pull down the new versions. Honestly, I’d much rather that they just extended the metadata in application and library bundles to include a download location and hash of the versions for other architectures.
Obviously this was way back before code signatures became very load bearing on OS X… during the Intel transition I used to have a script that would spin over an app bundle and use lipo to create “thin” binaries so I could have enough room on my little SSD for all the things I used. I also pruned unnecessary localization files.
I forget what size that SSD was, but the difference was significant enough that learning lipo and scripting it out was worth my time.
I can definitely understand how confusing it would be to offer multiple architecture downloads. That being said I would strongly encourage you to at least provide a way to get to the ARM64 version if it’s trivial for you to build for it. That way seekers can run your app with the best performance on their machine.
Honestly I’m surprised that tools like MSIX don’t support multiple architectures.
Ended up submitting three PRs because it seems that you didn’t notice that Rust and Firefox literally just work as native Arm, Visual Studio Code has it in the beta channel, with release happening within the month too. :-)
(Just as a note for anyone not following the GitHub discussion)
VS Code will be marked as available once the ARM64 version is in the stable channel.
Rust requires an additional step not required by the x86 version, so until the experience is transparent to the user I’m not going to mark it as available. That being said, I’ll be filing an issue with rustup to hopefully get it to install the ARM64 toolchain by default.
Firefox might get marked as available depending on if Firefox Installer.exe installs the ARM64 version.
I think that Apple replied in the original post (in the section “A Note On Web Applications Added to the Home Screen”):
As mentioned, the seven-day cap on script-writable storage is gated on “after seven days of Safari use without user interaction on the site.” That is the case in Safari. Web applications added to the home screen are not part of Safari and thus have their own counter of days of use. Their days of use will match actual use of the web application which resets the timer. We do not expect the first-party in such a web application to have its website data deleted.
If your web application does experience website data deletion, please let us know since we would consider it a serious bug. It is not the intention of Intelligent Tracking Prevention to delete website data for first parties in web applications.
PWAs are not affected if they are installed onto the home screen, if you keep using them inside Safari they are still affected and so are all the other web sites. It is also a bit confusing because the wording under:
Web applications added to the home screen are not part of Safari and thus have their own counter of days of use.
Emphasis added by me to highlight that they are counting days for the installed PWA usage, which makes me wonder if they are deleting it as well or why they are counting the days of usage in such cases. I don’t know.
Safari: each day you use it counts as a day. If 7 days of usage have passed without you visiting a specific website, its data is erased.
Homescreened webpage: each day you use it counts as a day. If 7 usage days have passed without you visiting the website, its data is erased. But since you visit the website every time you click the icon on your home screen, the counter should never go above 1. (If you’re using some third party domain to store the data, if it gets erased depends on how the webpage works and what your user does.)
I would say that both Erlang and OpenCL are made for multicore. They are very different languages, because the problems they solve are very different, but they still target multicore systems. There are different aspects of parallelism that they target.
As for taking a serial program and making use of parallelism, this is actually possible with C and Fortran. With gcc you can get, if you’re careful, instruction-level parallelism through auto-vectorization. I believe that Intel’s compiler can even give you automatic multithreading.
Context: Intel decided to kill the HLE part of TSX on current CPUs via a microcode update… and on Linux it was chosen to kill the other part too instead of relying on mitigations.
For Windows, it’s a bit more complex because it doesn’t support interrupt controllers outside of a standard GIC for the ARMv8 port, so some patching will be required.
I’m in the process of getting a basic Linux port running since a while though.
For GrayKey and such, this allows them to image and then restore back the keys when the SEP thrashes them from NAND after the input attempts are exceeded, making it possible to continue bruteforcing.
Yes, they use a custom AIC interrupt controller instead of the ARM GIC unlike pretty much everyone else now.
Also, their CPUs since the A10 only implement EL1 and EL0, no EL2 or EL3 anywhere in them (+ a metric ton of custom registers, from KTRR through APRR, and WKdm compression extensions even and more + AMX on A13 onwards)
Also about non-standard interrupt controllers and Windows, forgot to talk about the Raspberry Pi exception, which was a very special case that didn’t happen twice.
Did I mention that it can bypass iCloud locked devices? (To turn them on with a custom/stock OS, not to break into another person’s OS, see SEP comment in other comment branch “below”)
I can personally confirm that they are running on the main AP cores.
It’s however possible that Xcode public tools won’t have support for it, and apps would only use it through Accelerate.framework in that case. (support for AMX in Accelerate.framework is already done since a long time)
If every OSS maintainer had a nickel for every time a random person on the internet implicitly accused them of hating freedom because a program behavior does not align with their political beliefs, the problem of funding OSS maintenance would be solved.
Maybe make it so that you have to pay a few cents to make an issue without a patch in an issue tracker? It’s certainly not a perfect solution, but at least people would have to start thinking about the importance of their comments.
Making people pay money would just make lots of bugs unreported until a golden master, which won’t be good for testing coverage, especially as businesses do not tend to touch prerelease core OS libraries…
It also would make the process of reporting bugs more complex, with anonymity being harder to guarantee for some users who would rather like it.
This article starts talking about something interesting but it looks as if the author hit publish when it was in an early design stage.
The idea that it’s talking about is having a type-1 hypervisor that sits below any kernel and exposes functionality for managing VMs into either one privileged guest or multiple guests. This is the model that Xen used from the start, with either Linux or NetBSD typically filling the role of the privileged guest. There was some work to allow domU to launch VMs by delegating resources assigned to it, but I’m not sure how far this went.
Windows also follows this model. Hyper-V implements a public spec and, in theory, Windows can use any hypervisor that provides the same interface. This is important because Windows actually relies on the hypervisor for some security functionality. The Hyper-V design has a notion of a virtual trust level, effectively giving a set of orthogonal privilege modes within a VM such that a VM can drop privilege for most of the kernel and retain strong isolation guarantees for the rest of it. Things like the credential manager run at a higher VTL so that a compromise in the kernel doesn’t automatically leak kernel-held secrets (though it does expose the APIs for accessing them, so an attacker may still be able to privilege elevate). Various other monitoring things live at this level.
Android with Halfnium has a similar model, where there’s a small hypervisor that is designed to allow components such as the credential store to be isolated from compromises of the Linux kernel.
The big difference between Xen and the other two is that Xen ships a scheduler in the hypervisor. Hyper-V and Halfnium both place their trusted guest in the TCB for availability, even when in modes that isolate it for confidentiality and integrity. This makes sense on Android because if Linux crashes then it doesn’t really matter if any of the other services work. It also makes sense in the Windows model (on the client, at least) for similar reasons.
The Arm Realms model is somewhat different. Arm talks about Realms as if they’re building a hardware confidential computing solution but they’re actually doing something a lot more sensible: providing hardware acceleration for a privilege-separated hypervisor. The RMM is an absolutely minimal hypervisor that provides guarantees about realm (guest) isolation (pages are either private to a realm, or shared and the realm is aware that they’re shared). The RMM is not responsible for allocating memory or scheduling. A hypervisor in EL2 (outside of the Realm World) must find pages to allocate to a realm, pass them to the RMM (at which point it loses any ability to access them), ask the RMM to create a realm with access to those pages, and add VCPUs to that realm, and so on. The EL2 hypervisor is trusted to schedule realms (it also has access to the reset line for the system, so there’s no way of removing it from the TCB for availability). If a realm needs to be paged out or migrated, the RMM is responsible for providing encrypted and integrity-protected copies of the pages to EL2-owned memory, but EL2 is responsible for swapping them out or transferring them to a new machine that can start the realm again. The idea is that the EL2 hypervisor will be feature-rich and large, but only the code in the RMM is able to violate realm-isolation guarantees and the RMM is small enough that it could be formally verified (I don’t know of any plans to do that yet).
Probably assumed too much familiarity with the area when writing it…
For Hyper-V actually both modes are available (https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/manage/manage-hyper-v-scheduler-types). On client Hyper-V, scheduling is delegated to the root domain. On server Hyper-V, the hypervisor has its own scheduler.
The root domain is still always required to be functional however, if only for VMM tasks and management.
However, omitted Hyper-V shielded VMs from the article because Microsoft didn’t implement them properly from the security perspective, resulting in relatively easy security guarantee breakage… :-/
The TLFS isn’t the most complete thing in the world sadly.
In practice too. :-) (at least for a certain subset)
VTL1 is an interesting design (with an equally cursed SecureKernel implementation that has its own downsides)… Apple has a hardware-assisted implementation of the concept with PPL (which uses the GXF lateral privilege level ISA extension) - https://blog.svenpeter.dev/posts/m1_sprr_gxf/.
In practice however, it fills the same role as an enclave in the way that Microsoft currently uses it.
Gunyah on the Qualcomm side is a practical implementation of the design, and is what is shipped today there. The downside of not having a lateral privilege level Realms-style of course is that you lose EL2 access for Linux on those Qualcomm platforms.
Looking forward to the next installment. The value of the Arm partnership really can’t be overstated. The agreement requires that all members review any proposal for inclusion in the architecture for patent infringement and, if it infringes any of their patents, either disclose it or grant every other partner a license for use in compliant Arm implementations. This means that any implementer has access to a massive patent portfolio.
Some of the partnerships in this article very nearly set Arm up for failure. Fragmentation killed MIPS and almost killed Arm. For a time, there were three incompatible floating-point extensions on Arm (there were more on MIPS), which meant that everyone who cared about binary compatibility ended up using a soft-float ABI. This didn’t matter in the ‘90s because companies like Nokia compiled everything for a specific phone SoC, but it mattered once third-party application ecosystems started to emerge. It’s unclear whether it still matters: Apple requires developers to upload LLVM IR to the app store now, so they can easily compile different versions of apps for different hardware features (as the did for PAC, for example).
Apple only required Bitcode for Watch. But even on that it’s gone this year. If you try to submit a bitcode app today, App Store Connect just refuses and tells you to re-upload as a regular binary.
Interesting:
I’m surprised to not see ChromeOS mentioned here, reading this analysis it seems it would stand up fairly well?
If you run a third-party OS on a Chromebook, doesn’t that severely compromise the security of the Chrome OS system? If I remember correctly, many Chromebooks required you to take out a screw to install another operating system and the process prevented secure boot from functioning on the primary Chrome OS installation.
What’s nice about Apple Silicon Macs (from my understanding) is that their secure boot settings are per-OS, not systemwide. You can still perform all of the signature checks on a macOS installation without doing so on a Linux system on the same disk.
Without some kind of physical intervention by users doesn’t that leave macs vulnerable to a persistent attack? Like an evil maid or trojan that installs something like a keylogging hypervisor that boots regular macOS. That would be indistinguishable from the perspective of the user and probably macOS yet could easily be malicious.
It does require physical actions. You have to
That said, I had to work on a chromebook for a while and that didn’t require a screw or anything to get into the unsafe mode, it was also a key chord.
There are a few critical differences though:
The article answered this.
It relies on their SEP being trustworthy which doesn’t have a great track record…
You’ve now shifted the goalpost from your original question (original goalpost was “vulnerable to a persistent attack” due to not requiring something similar to Chromebooks’ screw removal, new goalpost is alleging flaws in the SEP). I’ll no longer be responding to you.
Wait, when was the SEP compromised?
Up to the Apple A10 by the checkra1n jailbreak (to bypass the measurement by the SEP used to lock data access on access to DFU for more recent iOS releases).
On the Apple A13 onwards, the measurement of the current SEP firmware version (by the monitor) is a component of the encryption key, making such attacks no longer able to have user data access.
How do these projects work as nvidia have there own proprietary compiler?
Nvidia’s proprietary compiler (NVCC) is actually just a wrapper around the host’s compiler and LLVM and reverse-engineering it isn’t that hard (in fact Google did it to implement CUDA support in LLVM and I did it in order to add support for CUDA in GNAT, GCC’s Ada frontend).
Ironically, CUDA is incredibly well documented…
Yes, I found it really odd that you could get so much information about the ISA, the execution model and even the various steps the toolchain goes through but nothing about the transformations happening to the source code. Fortunately
gcc -E
andnvcc --verbose --keep
help a lot there :).The interactions between AWS and companies behind open source databases are always going to devolve into scuffles that hurt the community.
The article quotes a tweet that points out how Elastic is “in it for the money”. Well yeah. The same exact thing is also true of AWS though, the only difference being that AWS is already making most of the money (by far, and same with every other OSS DB that they offer in DBaaS form) and at the same time Amazon is the company where employees have to pee in bottles while sales data from the main Amazon business is used to identify successful products and clone them.
In other words, Amazon wants the money and it also needs to control how its brand is perceived by the public, making their actions no less empty than those of Elastic, aside from the fact that Elastic is not as good at this game, apparently.
(disclaimer: am an engineer at AWS, but working on something totally unrelated to this)
About the “in it for the money” part:
When a company makes their product open-source, they explicitly renounce their monopoly on the commercial exploitation of that product, in exchange of greater adoption.
It’s not a light decision to take and comes with long-term consequences. Wanting to close the product afterwards is trying to have their cake and eat it too, after the market has already adopted it.
If they wanted others to not commercially use/offer the product, their choice was to keep it closed-source. It might not have been anywhere near as popular in that case however.
And I don’t see how your complaints about some things happening in the Amazon retail org affect AWS. :)
What, you think that they wouldn’t make you pee in a bottle too if it helped their bottom line?
This may be a dumb question, but does adding serialization/deserialization greatly increase the latency of the RAM? Won’t there always be a benefit in keeping RAM directly attached?
10-15ns penalty on Power10 because of the externally attached DRAM controller. It’s just a footnote.
10-15ns hardly constitutes a footnote for main memory latency.
On the machine that I write this on, latency to DRAM is 170ns. It’s a high-end multi-socket capable CPU from 2018.
It matters far less there than on most customer workloads.
Hmmm. Could you share your machine specs and measurements in more detail? On my less high end machine from 2017 main memory latency is ~60ns. And from personal experience I’d be shocked if any recent CPU had >100ns main memory latency.
Client processors have far lower memory latency than server ones.
It’s nearly impossible to find a server CPU with below 100ns of memory latency. But at least, there’s plenty of bandwidth.
A random example from a machine (not my daily, which isn’t using AMD CPUs): https://media.discordapp.net/attachments/682674504878522386/807586332883812352/unknown.png
Wow! That’s the same as if the memory was on the other side of the room.
What does directly attached mean? Suppose main memory is connected to the CPU cores via a cache, another cache and a third cache, is that directly attached? Suppose main memory is distant enough, latent enough, that a blocking read takes up as much time as executing 100 instructions, is that directly attached?
It’s a question of physics, really: How quickly can you send signals 5cm there and 5cm back? Or 10cm, or 15cm. Modern RAM requires sending many signals along slightly different paths and having them arrive at the same time, and “same time” means on the time scale that light takes to travel a few millimeters. Very tight time constraints.
(Almost two decades ago we shifted from parallel interfaces to serial ones for hard drives, AIUI largely to get rid of that synchronisation problem although I’m sure the narrower SATA cables were more convenient in an everyday sense too.)
Even DIMMs have a page and row selection mechanisms that introduce variable latency depending on what you want to access. Add 3 levels of caching between that and the CPU and it’s rather likely that with some slightly larger caches somewhere and much higher bandwidth (as is the promise of independent lanes of highly tuned serial connections) you more than compensate for any latency cost incurred by serialization.
Also, memory accesses are per cache-line (usually 64 byte) these days, so there’s already some kind of serialization going on when you try to push 512 bits (+ control + ECC) over 288 pins.
I could see this as wanting to keep the implementation as simple as possible, so the question becomes: would we actually want this safety built in or is it enough to put the whole thing into a “secure box”?
A core design principle of web assembly was that it be able to provide a target for more or less any language. That meant the language object model would be opaque to the VM, it also means life time of an allocation is opaque to the VM. The result of that is that the WASM VMs basic memory model has to be a single blob of addressable memory. Some of this is also because of the Mozilla “JS subset” wasm implementation that operated on typed arrays.
This brings with it other constraints - no builtin standard library, no standard types, no object introspection, and because it’s intended to be used in browser the validation and launch must be fast as caching of generated code is much less feasible than in a native app - hence the control flow restrictions.
The result of all of this is that you can compile Haskell to wasm without a problem, or .NET, or C++ and they all run with the same VM, and none of them incur unreasonable language specific perf penalties (for example you cannot compile Haskell to the CLR or JVM without a significant performance penalty), but compiling to WASM works fine. C/C++ can treat pointers and integers as interchangeably and unsafely as they like, without compromising the browser. And .NET and JVM code can apparently (based on other comments so could be totally wrong here) run in that WASM VM as well.
We of course also want inside-box safety. The question is cost and tradeoff.
If Java and .NET do it just fine (and they do), there’s no perf cost excuse there.
There’s a significant penalty to languages with significantly different type systems when running under .NET and the JVM. That’s why you tend to get similar, but slightly different, versions of languages - Scala, F#, etc instead of Haskell/*ML - basically the slight differences are changes to avoid the expensive impedance mismatch from incompatible type systems. The real Haskell type system cannot be translated to either .NET or the JVM - even with .NET’s VM level awareness of generic types - and as such takes a bunch of performance hits. Trust me on this.
Similarly compiling C and C++ to .NET requires sacrificing some pointer shenanigans that wasm allows (for better or for worse).
No, they don’t. C++ code compiled with /clr:safe does slow down. (It doesn’t slow down without the option, but it doesn’t provide inside-box safety either.)
Compared to /clr:pure yes, due to some optimisations missed on earlier .NET CLR versions. (the move to Core alleviated most of that overhead… but initially came with a removal of C++/CLI outright before it was added back), and of course, all C#/F# code runs with those checks enabled all the time.
Having the option is always better than not having it for things like this though.
I’d like to buy an M1 Mac for the better battery life and thermals, but I have to run a lot of Linux VMs for my job, so it’s a nonstarter.
If VirtualBox or VMWare or whatever adds support for M1 Macs to run ARM VMs and I could run CentOS in a virtual machine with reasonable performance, it would definitely affect my decision.
(Note that I’d still have to think about it since the software we ship only ships for x86_64, so it would be…yeah, it would probably still be a nonstarter, sadly.)
Parallels runs well, at least for Windows. I’ve heard the UI for adding Linux VMs is picky, but they’ll work fine too.
Much of the work around HVF/Virtualization.framework is to make Linux stuff drop-dead easy.
And Qemu is a good option for those too, with picking the patchset from the mailing list for using HVF.
VMWare Fusion support is coming, VirtualBox will not be supported according to Oracle.
I do have Parallels running Debian (10, arm64) on an M1. It was a bit weird getting it setup, but it works pretty well now, and certainly well enough for my needs.
There’s a Parallels preview for M1 that works: https://my.parallels.com/desktop/beta
It has VM Tools for ARM64 Windows, but not Linux (yet).
In my opinion Linux is a better experience under QEMU w/ the patches for Apple Silicon support (see https://gist.github.com/niw/e4313b9c14e968764a52375da41b4278#file-readme-md). I personally have it set up a bit differently (no video output, just serial) and I use X11 forwarding for graphical stuff. See here: https://twitter.com/larbawb/status/1345600849523957762
Apple’s XQuartz.app isn’t a universal binary yet so you’re probably going to want to grab xorg-server from MacPorts if you go the route I did.
Genuine question, why not just have a separate system to run the vm’s? That keeps the battery life nice at the expense of requiring network connectivity but outside of “on an airplane” use cases its not a huge issue i find.
That’s wrong, it’s switchable at runtime via the SETEND instruction on 32-bit Arm. (including from user-space!)
The target triplet for 32-bit Arm (big endian) with hardware floating point is armeb-linux-gnueabihf.
Thanks, I did not know that! Added it to the article, including the things you mentioned in your other comment.
thunderxx, qualcomm centriq also dropped 32-bit compatibility.
Cortex-A65(AE), Cortex-A34 too.
Why UEFI, and not (mainline) u-boot?
One objection could be that the standard shouldn’t be for U-Boot, but for something more generic. Of course, there are already standards for booting which aren’t specific to U-Boot. Cynically, I think it’s due to secure boot, for which EFI has a better story than U-Boot.
That’s much higher-level, and Linux-specific.
Linux isn’t the only OS that matters. (and this choice is not Secure Boot related)
UEFI is an actual standard instead of an implementation. Device Tree is also not a real standard, but it’s instead “whatever Linux does”.
UEFI + ACPI on Arm64 allows you to boot Windows, the BSDs…
Technically, there is a devicetree specification, but for the rest (e.g. 95% of real device trees) it’s just “whatever Linux does.” FWIW Linux does have good support for doing some traditional things a specification might require (like providing a way to validate one’s device trees), but the semantics of bindings are much more fast-and-loose.
And it turns out that Apple uses device trees too, but an incompatible implementation that pre-dated Device Tree on Linux on Arm in the first place, for their own devices.
(same format set in stone from the iPhone in 2007 to the Apple Silicon Macs today)
See what it looks like: https://gist.github.com/freedomtan/825993147700874119f590ecfdae97ed
Quite close isn’t it? But yet totally incompatible.
That’s not surprising, OS X had a quite strong dependency on OpenFirmware, where Device Trees originated.
Device Tree and FDT are also somewhat different. The only official specification for FDT is the ePAPR specification. I’m the author of the BSD-licensed
dtc
tool and almost all of the DTS files in the Linux tree now include extensions from the GPL’ddtc
tool that are not part of the ePAPR standard. They are not always well documented so some reverse engineering is typically needed. Modern Linux / FreeBSD FDT includes a basic form of dynamic linking (‘overlays’), so you can refer to things in one blob from another. In traditional OpenFirmware FDTs, any of this was handled by Forth code in the expansion module.SolidRun MACCHIATObin ($349 with a useless 4GB DIMM included, $449 with a 16GB one, there’s a more expensive “Double Shot” version with more Ethernet ports and higher CPU clock.. out of the box, setting a jumper seems to achieve 2GHz on “Single Shot” just fine)
Supported in upstream TianoCore EDK2 + Arm TF-A. Firmware provided by your own build or mine, out of the box comes with U-Boot instead, can boot firmware from microSD or SPI flash or streamed over UART (for recovery).
PCIe controller does support ECAM, but has a bug. It doesn’t filter something about which device the packets are from (I forget the term) so some devices would appear as replicated into all slots (or just a couple slots). The upstream EDK2 workaround is the “ECAM offset” (making the OS only see the last device), but that’s bad because some devices (that use ARI (IIRC) or do their own filtering for some reason otherwise (modern Radeons)) don’t get duplicated, so a Radeon RX 480 would just be completely unseen with the offset, so it’s rolled back in my FW builds. But a Radeon HD 7950 did get duplicated into two slots. Had to patch the FreeBSD kernel to ignore the devices after the first one to test that card :) There were plans to add setup toggles to upstream EDK2 for the offset, or to add a
_HID
that would let OSes use a quirk, but idk where that went.Not the most attractive option now that the LX2160 is out there and with good firmware, you can get a 16-core instead of a 4-core.
and now some random fun stuff that we can’t get our hands on
Gigabyte MP30-AR1 (seems unobtainable by now, maybe keep monitoring ebay for years)
Ancient Applied Micro X-Gene 1. Proprietary AMI UEFI. See http://chezphil.org/rwanda/
Huawei Kunpeng Desktop Board (not retail, only sold to businesses?)
Official website links to a “Get Pricing/Info” form with a “Budget” field where the smallest value is “less than $50,000”.
Huh, there is a way to run a hypervisor on Qualcomm’s weird firmware at all?! If someone reverse engineers this, would it eventually make KVM on Android phones possible? :D
It’s not the newest product around anymore, and its performance class today is in the same one as a much cheaper Raspberry Pi 4, although that has less I/O expansion.
Yup, but remember the “Secure” part, good luck for the signature. (the thing is that Qualcomm has already their hypervisor there for DRM stuff, and so if you just give access to EL2 to anyone their DRM scheme is toast)
This is very impressive computer science done by nvidia engineers. It just sucks that it’s hamstrung by a terrible distribution model. For a vast amount of software, adding a dependency on a proprietary compiler just isn’t going to fly. Especially when OpenCL, OpenGL compute shaders and Vulkan compute shaders all exist. Those obviously provide a worse development experience, but a better user experience (since more GPUs are supported), and – crucially for open-source projects – don’t require proprietary tools to build.
The license agreement contains fun tidbits, such as:
That just seems evil to me. It means that trying to get the SDK to run on unsupported versions of Linux, or on Windows, or on MacOS, or on any BSD, is a violation of copyright law (at least if that part of the license is actually enforceable).
It also contains:
So it seems you’re not allowed to compile software using the HPC SDK and then release the binaries under a license which allows a user to redistribute the binary.
You also “agree to notify NVIDIA in writing of any known or suspected distribution or use of the SOFTWARE not in compliance with the requirements of this SLA, and to enforce the terms of your agreements with respect to distributed SOFTWARE”. I don’t know how that would be interpreted exactly, but it sounds a lot like I would be forced to send a written notice to nvidia if I ever see the HPC SDK on pirate sites, or if I see a software project which uses the HPC SDK but doesn’t have the required attribution, etc etc etc.
I’ll pass on this one.
I don’t think that’s the target market, anyway. Considering that they used LULESH from Lawrence Livermore as an example, it seems they’re targeting large HPC installations, and those won’t have any qualms about switching to a proprietary compiler if it delivers a big enough speed up and they don’t distribute binaries so it’s a non-issue for them. According to this technical report they were fine using Intel’s compiler, and whatever proprietary system they needed to compile for the BlueGene super computer.
Yeah, obviously execs at nvidia aren’t stupid, they know what they’re doing. If they thought they could make more money through releasing the compiler under an open source (or just less hostile) license they would have. It’s just sad that this is remarkable feat of computer science gets relegated to a few niche use cases when the technology behind it would have the potential to significantly improve the world of computing.
For context: The NVIDIA HPC SDK is the continuation of the (quite expensive) PGI compiler. It only went from paid at huge prices to free and publicly accessible this August. Currently, the EULA didn’t quite get adapted for that fact yet.
For changing that license that is much less appropriate now that much more people can use it, I’ll ask NVIDIA about it, I cannot provide guarantees though.
With the introduction of macOS on ARM, I was curious how well Windows applications have been updated to support Windows on ARM, which was first released a little over two years ago. Since then, well, it doesn’t look like much progress has been made. Of the popular applications I was able to come up with, only one, VLC, has support for ARM64. I think a large part of this difficulty is the fact that virtually no development platforms support ARM64. Many Windows apps are built with WPF, which doesn’t support ARM64. Even on the Microsoft side, only a handful of applications are compiled natively for ARM64.
I hope that by calling out the lack of support for ARM64 I can help push the platform forward and encourage more applications to release ARM64 versions!
Microsoft doesn’t have fat binaries. That makes a huge difference.
On macOS I press “Build” in Xcode and ship it. Assuming the code was portable, that’s all I need to do. Users don’t even need to know what CPU they have, and apps will continue to work—natively—even when the user copies them to a machine with a different CPU.
For Windows, I need to offer a separate download, and ask users to choose the version for the CPU they have, and then deal with support tickets for “what is CPU and why your exe is broken?” Or maybe build my own multi-arch installer that an ARM machine can run under emulation, but it can still detect and install for the non-emulated CPU. I don’t have time for either of these, so I don’t ship executables for ARM Windows, even though I could build them.
I don’t mean this to be too snarky, but do you test it on both systems?
I already have a multi-arch installer, and my code compiles for ARM64 fine, but I wouldn’t want to update the installer to point to a binary that I’ve never executed. The lack of supported virtualization options is noteworthy here. Right now my only real option is to spend $500-$1000 on an ARM Windows machine for this specific purpose.
Without a Mac devkit it’s hard to be sure, but I’d swear I saw a demo where Xcode can just launch the x64 version under Rosetta, so it becomes possible to test both on one machine. Unfortunately developers need new hardware because there’s no reverse-Rosetta for running ARM code on x64, so porting will still take time.
I’m not so sure that we really need reverse-Rosetta. The iOS simulator runs x86_64 binaries and is really accurate (except performance-wise). The Apple ecosystem already has an extensive experience of supporting both ARM and x86_64 binaries, and most Macs should be ARM in a few years in anyway. And there is already the ARM Mac mini thingy for developers.
I haven’t, actually. In case of PPC and x86->x64 switches I just bought the new machine and tested only there. I already knew my code worked on the old architecture, so testing on both didn’t seem critical. In Apple’s case these are transitions rather than additions of another platform.
I don’t know if anyone is actually shipping things like this, but it is possible to do this on Windows by building the application as a DLL and then using a tiny .NET assembly that queries the current architecture and then loads and P/Invokes the correct version of the DLL. I saw a proof-of-concept for this a very long time ago, but I don’t think there’s tooling for it.
I’m not really convinced by how Apple does fat binaries. It might be a space saving if the linker could dediplicate data segments, but (last I checked) ld64 didn’t and so you really end up with two binaries within a single file. The NeXT approach was a lot more elegant. Files specific to each OS / Architecture (NeXT supported application bundles that ran on OpenStep for Windows or OpenStep for Solaris as well as OPENSTEP) was in a separate directory within the bundle, along with directories for common files. You could put these on a file server and have apps and frameworks that worked on every client that mounted the share, or you could install them locally and trivially strip out the versions that you didn’t need by just deleting their directories.
The
ditto
tool on macOS was inherited from NeXT and supported thinning fat bundles and was extended to support thinning fat binaries when Apple started shipping them. That’s a bit awkward for intrusion detection things because it requires modifying the binary and so tooling needs to understand to check signatures within the binary, whereas the NeXT approach just deleted files.Now that no one runs applications from a file share, the main benefit from fat binaries is during an upgrade. When you buy a new Mac, there’s a migration tool that will copy everything from the old machine to the new one, including applications. With a decent app store or repo infrastructure, such a tool would be able to just pull down the new versions. Honestly, I’d much rather that they just extended the metadata in application and library bundles to include a download location and hash of the versions for other architectures. Then you could go and grab them when you migrated to a different system but not waste bandwidth and disk space on versions that you don’t need.
Obviously this was way back before code signatures became very load bearing on OS X… during the Intel transition I used to have a script that would spin over an app bundle and use lipo to create “thin” binaries so I could have enough room on my little SSD for all the things I used. I also pruned unnecessary localization files.
I forget what size that SSD was, but the difference was significant enough that learning lipo and scripting it out was worth my time.
I can definitely understand how confusing it would be to offer multiple architecture downloads. That being said I would strongly encourage you to at least provide a way to get to the ARM64 version if it’s trivial for you to build for it. That way seekers can run your app with the best performance on their machine.
Honestly I’m surprised that tools like MSIX don’t support multiple architectures.
msixbundle supports this - see https://docs.microsoft.com/en-us/windows/msix/package/bundling-overview
Glad to see that! I figured they’d have some sort of solution there.
Ended up submitting three PRs because it seems that you didn’t notice that Rust and Firefox literally just work as native Arm, Visual Studio Code has it in the beta channel, with release happening within the month too. :-)
(Just as a note for anyone not following the GitHub discussion)
VS Code will be marked as available once the ARM64 version is in the stable channel.
Rust requires an additional step not required by the x86 version, so until the experience is transparent to the user I’m not going to mark it as available. That being said, I’ll be filing an issue with
rustup
to hopefully get it to install the ARM64 toolchain by default.Firefox might get marked as available depending on if
Firefox Installer.exe
installs the ARM64 version.I think that Apple replied in the original post (in the section “A Note On Web Applications Added to the Home Screen”):
As far as I understand, PWAs are not affected so…
PWAs are not affected if they are installed onto the home screen, if you keep using them inside Safari they are still affected and so are all the other web sites. It is also a bit confusing because the wording under:
Emphasis added by me to highlight that they are counting days for the installed PWA usage, which makes me wonder if they are deleting it as well or why they are counting the days of usage in such cases. I don’t know.
I find the text incredibly confusing.
So, what counts as use? Do I have to open the app? And after what number of days is my data deleted?
What counts as use is opening the app from the home screen. Data will be deleted after 7 days without opening the app.
Safari: each day you use it counts as a day. If 7 days of usage have passed without you visiting a specific website, its data is erased.
Homescreened webpage: each day you use it counts as a day. If 7 usage days have passed without you visiting the website, its data is erased. But since you visit the website every time you click the icon on your home screen, the counter should never go above 1. (If you’re using some third party domain to store the data, if it gets erased depends on how the webpage works and what your user does.)
I find it confusing as well.
I would say that both Erlang and OpenCL are made for multicore. They are very different languages, because the problems they solve are very different, but they still target multicore systems. There are different aspects of parallelism that they target.
As for taking a serial program and making use of parallelism, this is actually possible with C and Fortran. With gcc you can get, if you’re careful, instruction-level parallelism through auto-vectorization. I believe that Intel’s compiler can even give you automatic multithreading.
This historical presentation by Guy Steele is an amazing introduction to the problem: What Is the Sound of One Network Clapping? A Philosophical Overview of the Connection Machine CM-5. The people at Thinking Machines Corporation had it down to an art.
Intel ISPC too. (and that supports both SIMD and multicore scaling)
Thank you for that link, it was incredibly educational.
Context: Intel decided to kill the HLE part of TSX on current CPUs via a microcode update… and on Linux it was chosen to kill the other part too instead of relying on mitigations.
Is this separate from this submission?
https://lobste.rs/s/fstlth/intel_disables_hardware_lock_elision_on
Yes, HLE is only a part of TSX and was disabled outright by Intel in newer microcode.
The other part of TSX, explicit TSX, was left enabled by Intel with mitigation recommendations, but it was chosen to disable it outright in Linux.
For Windows, it’s a bit more complex because it doesn’t support interrupt controllers outside of a standard GIC for the ARMv8 port, so some patching will be required.
I’m in the process of getting a basic Linux port running since a while though.
For GrayKey and such, this allows them to image and then restore back the keys when the SEP thrashes them from NAND after the input attempts are exceeded, making it possible to continue bruteforcing.
Does apple have a custom interrupt controller?? o_0
Yes, they use a custom AIC interrupt controller instead of the ARM GIC unlike pretty much everyone else now.
Also, their CPUs since the A10 only implement EL1 and EL0, no EL2 or EL3 anywhere in them (+ a metric ton of custom registers, from KTRR through APRR, and WKdm compression extensions even and more + AMX on A13 onwards)
Also about non-standard interrupt controllers and Windows, forgot to talk about the Raspberry Pi exception, which was a very special case that didn’t happen twice.
Did I mention that it can bypass iCloud locked devices? (To turn them on with a custom/stock OS, not to break into another person’s OS, see SEP comment in other comment branch “below”)
Another post in that thread argues that these instructions won’t be user facing. We will find out in a few days.
I can personally confirm that they are running on the main AP cores.
It’s however possible that Xcode public tools won’t have support for it, and apps would only use it through Accelerate.framework in that case. (support for AMX in Accelerate.framework is already done since a long time)
If every OSS maintainer had a nickel for every time a random person on the internet implicitly accused them of hating freedom because a program behavior does not align with their political beliefs, the problem of funding OSS maintenance would be solved.
Maybe make it so that you have to pay a few cents to make an issue without a patch in an issue tracker? It’s certainly not a perfect solution, but at least people would have to start thinking about the importance of their comments.
Making people pay money would just make lots of bugs unreported until a golden master, which won’t be good for testing coverage, especially as businesses do not tend to touch prerelease core OS libraries…
It also would make the process of reporting bugs more complex, with anonymity being harder to guarantee for some users who would rather like it.
Agreed. More bugs would go unreported and I fixed that way. It makes for an interesting thought experiment, however.