Apple’s CoreFoundation library makes extensive use of low-bit pointer tagging. This allows it to store small integers (and maybe floats?), booleans and short strings directly in a pointer without allocating memory.
The encoding gets pretty complex, especially for strings; IIRC there are multiple string encodings, one of which can crunch characters down to 5 bits so it can store a dozen(?) characters in a 64-bit pointer. This sounds expensive, but I assume there are SIMD tricks to speed it up, and it’s still going to be way faster than allocating and dereferencing a pointer.
Any reference on the 5 bit encoding? This is in APIs that get called from Objective C or Swift?
The low bit tagging seems like an obvious win and portable win these days, although I know Lua avoided it because it’s not strictly ANSI C. Rust also got rid of small string optimization early in its life, apparently due to code size
You got it — the reference I remember is from Mike Ash’s blog, which has been dormant for a few years, but the archives are a treasure trove of low-level info about Apple’s runtime.
The CoreFoundation types are exposed in Obj-C as NSString, NSNumber, NSDate, NSValue. They also show up in Swift for bridging purposes, but the native Swift string and number classes are implemented differently (in Swift.)
I was actually looking at Mike Ash’s blog this week for info on tagged pointers:
How about 5 bits? This isn’t totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q. 5 bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.
I was actually looking at the blog because I was wondering if ref counting in the unused 16 bits of a pointer might be feasible. It would give you up to 65k references, which is more than enough for many (most?) use cases. That would slim down the size of ref counted values and might make them as cache friendly as a GCed value. Might not be thread safe though.
Wow, skimming the rest of the post, this is a lot more subtle than I would have expected, and also relatively recent – OS X 10.10 as of 2014.
If the length is between 0 and 7, store the string as raw eight-bit characters.
If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX”.
If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013”
The five-bit alphabet is extremely limited, and doesn’t include the letter b! That letter must not be common enough to warrant a place in the 32 hallowed characters of the five-bit alphabet
Pretty crazy!
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
For what I’m doing, we have to call into libc more, and pass it C strings, so we don’t control that part. The allocation to decode and make it look like a C string is problematic. Not just slow, but creates an ownership problem.
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
It makes me wonder if that is a headwind for adoption to Swift on other platforms. Is the language so tuned to performance on a single platform that it makes the code difficult to port?
It seems like they are also doing some pretty intricate things with C++ interop. As a outside observer (and a relatively ignorant one at that), it seems like it would be very difficult to generalize some of this work.
It shouldn’t be hard to port. In GNUstep, we adopted a compressed strings in pointers encoding some years before Apple (not the 5-bit one. Doing that well requires some analysis of data that I didn’t have access to from run-time profiling of a large set of real-world applications). The interface for iterating over strings allows the caller to provide a buffer. These strings are, by definition, of small bounded length and so converting to a C string for interoperability is trivial with the caller providing the buffer on its stack.
It does have some very nice performance properties. A lot of dictionaries use small strings as keys. If you check the length on the way in and try to convert mutable strings used for lookup to small strings then you know that the result either is or isn’t a small string. This lets you skip a bunch of comparisons after you’ve found the hash bucket.
This stuff is part of CoreFoundation, not Swift. Swift on Apple platforms has some pretty sophisticated interop / FFI with it, for compatibility with Objective-C code, but that isn’t present in Swift on other platforms.
I haven’t followed Swift closely, but I do have the feeling that the focus is entirely on Apple’s platform, and there are some fairly hard tradeoffs with respect to portability / open source.
Just like I think of Google’s entire stack as a vertically integrated embedded system (hardware up to cloud and apps), Apple seems to be architected in a similar way. They get some pretty big benefits from the vertical integration and control
Andreas Kling mentioned that Apple is a huge inspiration for SerenityOS – basically doing everything yourself and not taking dependencies, which is kind of the opposite of most open source, which is about reuse and portability
It seems like if Swift were going to be popular on Linux or Windows, that would have already happened by now. Looks like it’s about 9 years since the first release now
You can’t put the ref-count in a pointer, because a pointer is a reference. If you increment a count in a pointer, that doesn’t affect any other pointers to the object, so no one else knows you added a reference.
CoreFoundation does (IIRC) store refcounts outside objects. My memory is hazy, but it might reserve a few bits for an internal refcount, and when that pins at its max value, the real refcount moves to a global hash table. The answer probably lies in Mike Ash’s blog. (Swift-native class objects don’t do this, though.)
[Update: just saw your other post below. What CF does uses spare bits in a pointer field inside the object; I thought you meant putting the refcount in pointers to the object.]
The refcount for a normal objc object is kept in a few points of the isa pointer, and then moved to a side table once the refcount exceeds the available bits. The result is that there’s no major memory overhead to the refcount in normal objects.
The various tagging schemes that objc (and by proxy swifts interop) uses are internal implementation details that can (and have) changed so it’s not API. Instead objc_msgSend and family handle it directly - similar to the myriad refcount stores and what not.
Yup. Plus the WebKit “Gigacage” and the v8 (Javascript runtime) uses this mechanism for sandboxing. I hear another browser engine is considering something similar.
On Intel, the 5 bit encoding can be optimized by using pdep to expand each 5 bits into a full byte, and pshufb to do a SIMD table lookup. I don’t think Arm has something like pdep though.
I don’t know much about ARM, but I guess you could:
Broadcast the lower and upper 32 bits of the NSString each into a simd register v1, v2.
Shift and mask each lane by a separate amount so that the relevant 5 bits are aligned with 16 bit lanes (and the relevant lanes don’t overlap between v1, v2).
Or v1, v2
Use a tbl instruction to recostruct the original order.
It’s not what led me to start writing the article but it’s definitely relevant. It’s also probably used way more in practice than putting data in the upper bits. Thanks @snej for the additional note - I actually came across some of the Mike Ash blog posts last night too. I’ll add a reference to them in.
For CHERI C, we found a lot of cases of storing things in the low bits (not least in LLVM’s compressed integer pointer pair template). This is a big part of why we support (on big CHERI) out of bounds representation beyond the end of a pointer’s range: so you can store data in the low bits of a pointer to the end of an array (you can’t dereference it without bringing it back).
Most (not quite all) of the places where things stored dat in the top bits, they were creating tagged unions of a pointer and some other data and rarely storing anything in the top bits other than a ‘this is not a pointer’ pattern, which can be replaced by the tag bit on CHERI (giving a full 128 bits of space for other data, something I started to explore but didn’t pursue very far and which I expect to be very useful for dynamic languages). A small number of things stored a a pointee type in the high bits.
Morello separates the value field of the capability into address and flags. With TBI enabled, you can store data in the top 8 bits. I believe this was mostly done to allow MTE composition later. It has some fun implications on arithmetic.
The relative frequency of characters determines the order of the characters in the sprite sheet. You see, to save space in code, and in the runtime, characters are not stored as ASCII. In place of ASCII, Huffman codes are used. The letter ‘a’ is 5, and the letter ‘b’ is 19.
As in, each character in the code you are currently writing in the editor is Huffman coded.
I remember when this was done on the Mac, back when they ran on the 68000. The 68000 had a physical 24-bit address bus, so plenty of software made use of the upper 8-bits, and then it took years for all that software to be cleaned up as soon as 68020s became popular (a 32-bit physical address bus). Looks like we still haven’t learned.
I think it was the 030 that first supported 32-bit addresses.
The problem wasn’t apps using the high bits themselves, but that the OS initially used them as an implementation detail of the memory manager, and some developers took shortcuts by testing or setting those bits directly in handles instead of using the official API calls like HLock / HUnlock. Then when the first 030 Macs rolled out, Apple changed the memory manager to store those tags elsewhere, and the apps that were groping the high bits would misbehave or crash.
Apps using undocumented behavior and accessing memory directly instead of using accessor functions (a lot of the original Toolbox data structures were public structs/records) was one of the main reasons it took Apple so long to move the Mac to a modern OS architecture.
Vmem makes this a non-issue: if the OS supports a per-application opt in to larger address spaces, it just means applications self limit by using tagged pointers.
Apple’s CoreFoundation library makes extensive use of low-bit pointer tagging. This allows it to store small integers (and maybe floats?), booleans and short strings directly in a pointer without allocating memory.
The encoding gets pretty complex, especially for strings; IIRC there are multiple string encodings, one of which can crunch characters down to 5 bits so it can store a dozen(?) characters in a 64-bit pointer. This sounds expensive, but I assume there are SIMD tricks to speed it up, and it’s still going to be way faster than allocating and dereferencing a pointer.
Any reference on the 5 bit encoding? This is in APIs that get called from Objective C or Swift?
The low bit tagging seems like an obvious win and portable win these days, although I know Lua avoided it because it’s not strictly ANSI C. Rust also got rid of small string optimization early in its life, apparently due to code size
https://old.reddit.com/r/rust/comments/2slcs8/small_string_optimization_remove_as_mut_vec/
Though honestly I would have expected fewer heap allocations and cache misses to be a win for most string workloads
You got it — the reference I remember is from Mike Ash’s blog, which has been dormant for a few years, but the archives are a treasure trove of low-level info about Apple’s runtime.
The CoreFoundation types are exposed in Obj-C as NSString, NSNumber, NSDate, NSValue. They also show up in Swift for bridging purposes, but the native Swift string and number classes are implemented differently (in Swift.)
I was actually looking at Mike Ash’s blog this week for info on tagged pointers:
https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
I was actually looking at the blog because I was wondering if ref counting in the unused 16 bits of a pointer might be feasible. It would give you up to 65k references, which is more than enough for many (most?) use cases. That would slim down the size of ref counted values and might make them as cache friendly as a GCed value. Might not be thread safe though.
Wow, skimming the rest of the post, this is a lot more subtle than I would have expected, and also relatively recent – OS X 10.10 as of 2014.
Pretty crazy!
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
For what I’m doing, we have to call into libc more, and pass it C strings, so we don’t control that part. The allocation to decode and make it look like a C string is problematic. Not just slow, but creates an ownership problem.
It makes me wonder if that is a headwind for adoption to Swift on other platforms. Is the language so tuned to performance on a single platform that it makes the code difficult to port?
It seems like they are also doing some pretty intricate things with C++ interop. As a outside observer (and a relatively ignorant one at that), it seems like it would be very difficult to generalize some of this work.
It shouldn’t be hard to port. In GNUstep, we adopted a compressed strings in pointers encoding some years before Apple (not the 5-bit one. Doing that well requires some analysis of data that I didn’t have access to from run-time profiling of a large set of real-world applications). The interface for iterating over strings allows the caller to provide a buffer. These strings are, by definition, of small bounded length and so converting to a C string for interoperability is trivial with the caller providing the buffer on its stack.
It does have some very nice performance properties. A lot of dictionaries use small strings as keys. If you check the length on the way in and try to convert mutable strings used for lookup to small strings then you know that the result either is or isn’t a small string. This lets you skip a bunch of comparisons after you’ve found the hash bucket.
This stuff is part of CoreFoundation, not Swift. Swift on Apple platforms has some pretty sophisticated interop / FFI with it, for compatibility with Objective-C code, but that isn’t present in Swift on other platforms.
Not really. In fact, Swift on Linux doesn’t have certain complexities that are present on Apple platforms due to the necessity of ObjC interop.
I haven’t followed Swift closely, but I do have the feeling that the focus is entirely on Apple’s platform, and there are some fairly hard tradeoffs with respect to portability / open source.
Just like I think of Google’s entire stack as a vertically integrated embedded system (hardware up to cloud and apps), Apple seems to be architected in a similar way. They get some pretty big benefits from the vertical integration and control
Andreas Kling mentioned that Apple is a huge inspiration for SerenityOS – basically doing everything yourself and not taking dependencies, which is kind of the opposite of most open source, which is about reuse and portability
It seems like if Swift were going to be popular on Linux or Windows, that would have already happened by now. Looks like it’s about 9 years since the first release now
You can’t put the ref-count in a pointer, because a pointer is a reference. If you increment a count in a pointer, that doesn’t affect any other pointers to the object, so no one else knows you added a reference.
CoreFoundation does (IIRC) store refcounts outside objects. My memory is hazy, but it might reserve a few bits for an internal refcount, and when that pins at its max value, the real refcount moves to a global hash table. The answer probably lies in Mike Ash’s blog. (Swift-native class objects don’t do this, though.)
[Update: just saw your other post below. What CF does uses spare bits in a pointer field inside the object; I thought you meant putting the refcount in pointers to the object.]
No, you were right in the first place. I was totally thinking about things wrong. I got wrapped around the tree a bit. Thank you for the correction.
It was this one - I added a link to it this morning after this thread!
The refcount for a normal objc object is kept in a few points of the isa pointer, and then moved to a side table once the refcount exceeds the available bits. The result is that there’s no major memory overhead to the refcount in normal objects.
I did see that in this article from Mike Ash’s site:
https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html
It made me happy to see that it wasn’t a totally stupid idea on my part! :]
The various tagging schemes that objc (and by proxy swifts interop) uses are internal implementation details that can (and have) changed so it’s not API. Instead objc_msgSend and family handle it directly - similar to the myriad refcount stores and what not.
Yup. Plus the WebKit “Gigacage” and the v8 (Javascript runtime) uses this mechanism for sandboxing. I hear another browser engine is considering something similar.
On Intel, the 5 bit encoding can be optimized by using
pdepto expand each 5 bits into a full byte, andpshufbto do a SIMD table lookup. I don’t think Arm has something likepdepthough.Pdep is a bit overpowered. You can also do it with a multishift, though arm doesn’t have those either…
I don’t know much about ARM, but I guess you could:
Storing tags in a few low bits isn’t really what this is about I think, as changes to total address space (adding high bits) don’t really impact it.
The article did talk about it, though.
It’s not what led me to start writing the article but it’s definitely relevant. It’s also probably used way more in practice than putting data in the upper bits. Thanks @snej for the additional note - I actually came across some of the Mike Ash blog posts last night too. I’ll add a reference to them in.
For CHERI C, we found a lot of cases of storing things in the low bits (not least in LLVM’s compressed integer pointer pair template). This is a big part of why we support (on big CHERI) out of bounds representation beyond the end of a pointer’s range: so you can store data in the low bits of a pointer to the end of an array (you can’t dereference it without bringing it back).
Most (not quite all) of the places where things stored dat in the top bits, they were creating tagged unions of a pointer and some other data and rarely storing anything in the top bits other than a ‘this is not a pointer’ pattern, which can be replaced by the tag bit on CHERI (giving a full 128 bits of space for other data, something I started to explore but didn’t pursue very far and which I expect to be very useful for dynamic languages). A small number of things stored a a pointee type in the high bits.
Morello separates the value field of the capability into address and flags. With TBI enabled, you can store data in the top 8 bits. I believe this was mostly done to allow MTE composition later. It has some fun implications on arithmetic.
I’ll just leave this here:
As in, each character in the code you are currently writing in the editor is Huffman coded.
http://sigusr2.net/be-moore-like-chuck.html
I remember when this was done on the Mac, back when they ran on the 68000. The 68000 had a physical 24-bit address bus, so plenty of software made use of the upper 8-bits, and then it took years for all that software to be cleaned up as soon as 68020s became popular (a 32-bit physical address bus). Looks like we still haven’t learned.
I think it was the 030 that first supported 32-bit addresses.
The problem wasn’t apps using the high bits themselves, but that the OS initially used them as an implementation detail of the memory manager, and some developers took shortcuts by testing or setting those bits directly in handles instead of using the official API calls like HLock / HUnlock. Then when the first 030 Macs rolled out, Apple changed the memory manager to store those tags elsewhere, and the apps that were groping the high bits would misbehave or crash.
Apps using undocumented behavior and accessing memory directly instead of using accessor functions (a lot of the original Toolbox data structures were public structs/records) was one of the main reasons it took Apple so long to move the Mac to a modern OS architecture.
The 68020 supported a 32-bit address bus.
Vmem makes this a non-issue: if the OS supports a per-application opt in to larger address spaces, it just means applications self limit by using tagged pointers.