I don’t understand why the paper jumps to NaN boxing, only to turn around and cast everything to uint64_t; am I missing something? The whole section on NaN (and BCD) boxing makes very little sense to me.
NaN boxing is a strategy for creating an information rich pointer that encodes the type of the thing pointed to, and often things like garbage colleciton info. It’s not an actual pointer. But a single NaN double might provide an index into a large array of Cells, or other data structure.
The reason to convert it to a uint64 is because you need to pull the bits out of the double to extract that information.
I guess my question is: why not just use a uint64_t in the first place, and not have to lose the extra bits to NaN tagging? An opaque typedef uint64_t cell seems like it would do what’s needed, but without losing the exponent bits to NaN?
I hope I’m not coming off as anything but confused; I’m sure there’s a good reason, I just don’t get it.
This is the part that my ultra-tired brain failed to type. But, yes, this. It’s still a double, just that if it’s a NaN there’s a lot of space to put other information, and you might as well take advantage of that.
NaN boxing is not a way of storing integers, it is a way of storing a tagged union of a double and a pointer. If you just used a uint64_t then you couldn’t differentiate between a 64-bit integer and a pointer.
That said, NaN boxing is a bit of an odd choice for Lisps. Lisp was the language that invented using the low bit for tagging. Pointers to Lisp cells are always at least 2-byte aligned and so you can use the low bit as a tag to differentiate between a 31- or 63-bit integer and a pointer. On modern architectures, loads and stores typically have an immediate field and so subtracting 1 from the address as part of a load or store is free.
JavaScript implementations favour NaN boxing because their only numeric type is an IEEE 64-bit float.
NaN boxing is a way of storing a tagged union of a double and something else, where the something else fits into the 48 bit data field of an IEEE NaN value. In this case, the author is using a NaN box to represent a tagged union of a double, the NIL object, an atom, a cons cell, a closure (CLOS), and a primitive (PRIM). All the tag information to represent these 6 alternatives is stored in the box (no indirection is needed to fetch it). This is a typical use case for NaN boxing.
On 64-bit platforms, it’s common to require 8-byte alignment for pointers, which gives you 8 possible tag values. Objective-C uses this, for example, to encode small strings in pointer values, in addition to integers and floating poin values that aren’t using the full encoding space. Cincom Smalltalk used one of the tags to indicate a floating point value where the low 2 bits of the mantissa were repeated, since that covered a large number of things that they saw in profiling (in addition to supporting 61-bit integers).
This finally clicked; since NaN isn’t a usable number, you need to avoid it in normal use anyway, and representing everything else inside the NaN means it can represent both at the same time.
Using uint64_t would still work (it’s 8 bytes either way), but it would require that we explicitly handle math on those values to avoid messing with pointers where numbers were required.
I believe chrome no longer uses nan boxing, but has a compact representation for integers that fit within 30 or so bits. Since double-precision ieee floats contain all 30-bit integers, no special affordances are needed to preserve semantics (beyond the obvious overflow detection). This seems like a very reasonable route to me.
I believe that this is not a new development. The V8 codebase is a linear descendant of Anamorphic Smalltalk, which picked up this representation from Smalltalk-80.
This is true. Also, Chromium V8 has had more than one numeric type internally for a number of years now.
A few years ago, V8 was using NaN boxing, and it had a direct representation for small integers that fit into a NaN box. So two numeric types were used internally.
Today, the Javascript language has 2 user-visible numeric types (floats and big integers). In V8, the NaN boxing is gone, and there is a tagged pointer representation that represents 32 bit small integers directly (on 64 bit platforms) or 31 bit integers on 32 bit platforms.
I don’t understand why the paper jumps to NaN boxing, only to turn around and cast everything to
uint64_t
; am I missing something? The whole section on NaN (and BCD) boxing makes very little sense to me.NaN boxing is a strategy for creating an information rich pointer that encodes the type of the thing pointed to, and often things like garbage colleciton info. It’s not an actual pointer. But a single NaN double might provide an index into a large array of Cells, or other data structure.
The reason to convert it to a uint64 is because you need to pull the bits out of the double to extract that information.
I guess my question is: why not just use a
uint64_t
in the first place, and not have to lose the extra bits to NaN tagging? An opaquetypedef uint64_t cell
seems like it would do what’s needed, but without losing the exponent bits to NaN?I hope I’m not coming off as anything but confused; I’m sure there’s a good reason, I just don’t get it.
So you can also store numbers in the same place without an extra memory indirection?
This is the part that my ultra-tired brain failed to type. But, yes, this. It’s still a double, just that if it’s a NaN there’s a lot of space to put other information, and you might as well take advantage of that.
NaN boxing is not a way of storing integers, it is a way of storing a tagged union of a double and a pointer. If you just used a
uint64_t
then you couldn’t differentiate between a 64-bit integer and a pointer.That said, NaN boxing is a bit of an odd choice for Lisps. Lisp was the language that invented using the low bit for tagging. Pointers to Lisp cells are always at least 2-byte aligned and so you can use the low bit as a tag to differentiate between a 31- or 63-bit integer and a pointer. On modern architectures, loads and stores typically have an immediate field and so subtracting 1 from the address as part of a load or store is free.
JavaScript implementations favour NaN boxing because their only numeric type is an IEEE 64-bit float.
NaN boxing is a way of storing a tagged union of a double and something else, where the something else fits into the 48 bit data field of an IEEE NaN value. In this case, the author is using a NaN box to represent a tagged union of a double, the NIL object, an atom, a cons cell, a closure (CLOS), and a primitive (PRIM). All the tag information to represent these 6 alternatives is stored in the box (no indirection is needed to fetch it). This is a typical use case for NaN boxing.
On 64-bit platforms, it’s common to require 8-byte alignment for pointers, which gives you 8 possible tag values. Objective-C uses this, for example, to encode small strings in pointer values, in addition to integers and floating poin values that aren’t using the full encoding space. Cincom Smalltalk used one of the tags to indicate a floating point value where the low 2 bits of the mantissa were repeated, since that covered a large number of things that they saw in profiling (in addition to supporting 61-bit integers).
This finally clicked; since NaN isn’t a usable number, you need to avoid it in normal use anyway, and representing everything else inside the NaN means it can represent both at the same time.
Using
uint64_t
would still work (it’s 8 bytes either way), but it would require that we explicitly handle math on those values to avoid messing with pointers where numbers were required.Thanks!
I believe chrome no longer uses nan boxing, but has a compact representation for integers that fit within 30 or so bits. Since double-precision ieee floats contain all 30-bit integers, no special affordances are needed to preserve semantics (beyond the obvious overflow detection). This seems like a very reasonable route to me.
I believe that this is not a new development. The V8 codebase is a linear descendant of Anamorphic Smalltalk, which picked up this representation from Smalltalk-80.
This is true. Also, Chromium V8 has had more than one numeric type internally for a number of years now.
A few years ago, V8 was using NaN boxing, and it had a direct representation for small integers that fit into a NaN box. So two numeric types were used internally.
Today, the Javascript language has 2 user-visible numeric types (floats and big integers). In V8, the NaN boxing is gone, and there is a tagged pointer representation that represents 32 bit small integers directly (on 64 bit platforms) or 31 bit integers on 32 bit platforms.