My initial reaction to the problem statement was: Why not add an instruction prefix or flag that causes integer overflow to trap? A quick scan around the net shows I’m not even in the first thousand people to suggest that. It’s just waiting for someone at Intel or ARM (or RISC-V!) to notice.
The article argues for hardware support for arithmetic operations on tagged integer formats wherein a value is either a fixed-size int or a pointer to a dynamically-allocated infinite-precision “bignum”.
To me this does not seem generally useful enough to bake into CPU ISAs. Yes, integer overflow is unfortunate and leads to bugs. But in the most common cases where performance matters, integers refer to fixed-capacity constructs like memory addresses, or quantities of things in memory (bytes, characters, array items…), or external values whose API limits their size, like database primary keys. In those cases even if you’re able to internally arrive at an arbitrary-size result, you’ve got to truncate it to 64 bits to use it for its intended purpose.
Sure, there are domains like cryptography, finance and some math/science where you need bignums. But in those you’d just hardcode everything to be a bignum, as today.
Numbers that might or might not be heap-allocated would be quite expensive to use. We’re very used to the idea that a number is a simple scalar that can be discarded without any work. If every int might have a pointer in it, then either you’ve got to add a ton of code to dealloc every consumed input value, or else adapt your GC to have to put write barriers on every integer field and scan every integer on the stack. Ouch.
The article argues for hardware support for arithmetic operations on tagged integer formats wherein a value is either a fixed-size int or a pointer to a dynamically-allocated infinite-precision “bignum”.
It doesn’t argue for full hardware support for bignums. It argues for extra instructions so the handling of tagged integers can be made more efficient.
Not immediately obvious to me, but I’d be curious about overlap with support instructions for tagged pointers, which are also somewhat of a thing.
Sure, there are domains like cryptography, finance and some math/science where you need bignums. But in those you’d just hardcode everything to be a bignum, as today. Numbers that might or might not be heap-allocated would be quite expensive to use.
Plenty bignum implementations use tagged integers to optimize the case of small numbers. (including the default integer type in some languages, as listed in the article). Using an always-heap-allocated type would be even more expensive.
Of course you can argue that all these languages made a bad choice, but they exist.
The argument I read is that fixed-size ints are bad (not true integers!) and integer overflow is dangerous, and the answer is bignums everywhere, so CPUs should add special instructions to make them faster.
Obviously bignum libraries use tagging to optimize small ints. My point is that domains that make heavy use of bignums are not things enough computers spend a lot of time doing, not enough to be worth making up special instructions to speed them up.
I don’t know about Python, nodejs, and other dynamic langs but I belive Ruby doesn’t have “native” integers and floats. I.e. it cannot just pass the values to the CPU as-is. My guess is that many other langs are similar.
So my guess is that a lot of languages where you don’t have to explicitly declare your data types would benefit from some hardware support for this common operation. The question is: how much does an extra instruction “cost” and is the benefit worth it? That I cannot answer. But the “bignums are rarely used” perspective is missing a huge piece of the computing ecosystem.
Smalltalk and Lisp both had this model and it’s quite nice, but it doesn’t eliminate the need to care about overflow. Integer overflow can quickly become a denial of service attack, because now each arithmetic operation needs to do a heap allocation. When I wrote Pragmatic Smalltalk, I had a little demo of a Fibonacci sequence generator in Objective-C and Smalltalk. The code looked almost the same, but after a short while the Objective-C one hit integer overflow and started giving the wrong answer. The Smalltalk one didn’t, but it did get around a thousand times slower.
I’m quite surprised that the performance that Daan is seeing matters. I wrote down the techniques that I used because, although lots of other people used them, no one had bothered to document them and so I had to work them out from scratch.
The switching the tag bit that Daan discusses is largely done for the other case: Most architectures have addressing modes that let you do a -1 (folded with a + field offset) in load and store instructions and so clearing the tag bit is free in load / store instructions and then you also don’t need to do anything with it for heap things.
Daan’s paper somewhat argues why this kind of thing isn’t in hardware: His tagged integer representation is not quite the one that we used and hardware would need to be generic for both to be useful. Actually, it’s much worse, because most mainstream languages that care about this kind of thing (JavaScript, Lua, and so on) use NaN boxing not tagged integers, which would need very different hardware. SPARC included instructions for tagged integer arithmetic and they turned out to be not quite the right shape and so no one used them. For 64-bit platforms, we used three tag bits (and the speedup from embedding small strings in pointers offset basically everything else - Apple subsequently adopted a more clever version for NSString).
It’s also hard to get the performance right. For example, a few years ago Arm added an EL0 exception mechanism so that you could get a trap on overflow and handle the trap in userspace. This was done because folks working on JavaScript VMs wanted to be able to do integer arithmetic and promote to float on overflow. It turned out that handling the traps cost much more than the cost of doing an extra predicted-not-taken branch and having a pipeline flush on the cases where it was wrong. That’s really your performance limit: it has to be faster than a pipeline flush to go to the slow path or it isn’t worth saving the branch.
The proposed instructions are just setting the carry bit in more conditions. That’s… fine, I guess. The problem is that this speedup will really not show up outside microbenchmarks. If you’re doing a load of arithmetic in loops, it will probably be slightly faster. Otherwise, the extra ops to calculate whether to hit the slow path are just exposed to instruction-level parallelism, the branch will be predicted not-taken. This kind of thing is precisely what modern CPUs are aggressively optimised for. I doubt you’d actually see more performance from these magic instructions in anything other than a synthetic benchmark.
Integer overflow can quickly become a denial of service attack, because now each arithmetic operation needs to do a heap allocation
The Smalltalk-80 text view hit this problem whenever the height of the text reached 32768 pixels. Performance suddenly went off a cliff. One of my summer-intern tasks was to find a workaround, so I hacked a subclass that used fixed line heights and measured everything in lines not pixels.
Fun. I found a related bug in NSTableView that was present in both GNUstep and Cocoa. Both (at least, I assume both - they had the same behaviour, but I could only look at the code for GNUstep) calculated the place to start drawing each row by multiplying the height of a row by the row index. Both used a float to calculate the address. That gave a 24-bit mantissa and so you started to hit rounding errors after a few tens of thousands of rows. I had a program that had a few million rows and so you’d see increasingly large gaps between adjacent ones. Each row was rendered correctly, just in the wrong place.
I found this during Apple’s 64-bit transition and the s/float/CGFloat/ change combined with defining CGFloat as double in Cocoa fixed it. On GNUstep, we also fixed it on 32-bit platforms by adding some explicit (double) casts and truncating the result.
In theory, floating-point values have the same graceful degradation property as BigInt—they just become less precise as the values get larger—but in practice the rounding errors matter when you care about the difference between two adjacent things.
My initial reaction to the problem statement was: Why not add an instruction prefix or flag that causes integer overflow to trap? A quick scan around the net shows I’m not even in the first thousand people to suggest that. It’s just waiting for someone at Intel or ARM (or RISC-V!) to notice.
The article argues for hardware support for arithmetic operations on tagged integer formats wherein a value is either a fixed-size int or a pointer to a dynamically-allocated infinite-precision “bignum”.
To me this does not seem generally useful enough to bake into CPU ISAs. Yes, integer overflow is unfortunate and leads to bugs. But in the most common cases where performance matters, integers refer to fixed-capacity constructs like memory addresses, or quantities of things in memory (bytes, characters, array items…), or external values whose API limits their size, like database primary keys. In those cases even if you’re able to internally arrive at an arbitrary-size result, you’ve got to truncate it to 64 bits to use it for its intended purpose.
Sure, there are domains like cryptography, finance and some math/science where you need bignums. But in those you’d just hardcode everything to be a bignum, as today.
Numbers that might or might not be heap-allocated would be quite expensive to use. We’re very used to the idea that a number is a simple scalar that can be discarded without any work. If every int might have a pointer in it, then either you’ve got to add a ton of code to dealloc every consumed input value, or else adapt your GC to have to put write barriers on every integer field and scan every integer on the stack. Ouch.
It doesn’t argue for full hardware support for bignums. It argues for extra instructions so the handling of tagged integers can be made more efficient.
Not immediately obvious to me, but I’d be curious about overlap with support instructions for tagged pointers, which are also somewhat of a thing.
Plenty bignum implementations use tagged integers to optimize the case of small numbers. (including the default integer type in some languages, as listed in the article). Using an always-heap-allocated type would be even more expensive.
Of course you can argue that all these languages made a bad choice, but they exist.
The argument I read is that fixed-size ints are bad (not true integers!) and integer overflow is dangerous, and the answer is bignums everywhere, so CPUs should add special instructions to make them faster.
Obviously bignum libraries use tagging to optimize small ints. My point is that domains that make heavy use of bignums are not things enough computers spend a lot of time doing, not enough to be worth making up special instructions to speed them up.
I don’t know about Python, nodejs, and other dynamic langs but I belive Ruby doesn’t have “native” integers and floats. I.e. it cannot just pass the values to the CPU as-is. My guess is that many other langs are similar.
So my guess is that a lot of languages where you don’t have to explicitly declare your data types would benefit from some hardware support for this common operation. The question is: how much does an extra instruction “cost” and is the benefit worth it? That I cannot answer. But the “bignums are rarely used” perspective is missing a huge piece of the computing ecosystem.
Smalltalk and Lisp both had this model and it’s quite nice, but it doesn’t eliminate the need to care about overflow. Integer overflow can quickly become a denial of service attack, because now each arithmetic operation needs to do a heap allocation. When I wrote Pragmatic Smalltalk, I had a little demo of a Fibonacci sequence generator in Objective-C and Smalltalk. The code looked almost the same, but after a short while the Objective-C one hit integer overflow and started giving the wrong answer. The Smalltalk one didn’t, but it did get around a thousand times slower.
I’m quite surprised that the performance that Daan is seeing matters. I wrote down the techniques that I used because, although lots of other people used them, no one had bothered to document them and so I had to work them out from scratch.
The switching the tag bit that Daan discusses is largely done for the other case: Most architectures have addressing modes that let you do a -1 (folded with a + field offset) in load and store instructions and so clearing the tag bit is free in load / store instructions and then you also don’t need to do anything with it for heap things.
Daan’s paper somewhat argues why this kind of thing isn’t in hardware: His tagged integer representation is not quite the one that we used and hardware would need to be generic for both to be useful. Actually, it’s much worse, because most mainstream languages that care about this kind of thing (JavaScript, Lua, and so on) use NaN boxing not tagged integers, which would need very different hardware. SPARC included instructions for tagged integer arithmetic and they turned out to be not quite the right shape and so no one used them. For 64-bit platforms, we used three tag bits (and the speedup from embedding small strings in pointers offset basically everything else - Apple subsequently adopted a more clever version for NSString).
It’s also hard to get the performance right. For example, a few years ago Arm added an EL0 exception mechanism so that you could get a trap on overflow and handle the trap in userspace. This was done because folks working on JavaScript VMs wanted to be able to do integer arithmetic and promote to float on overflow. It turned out that handling the traps cost much more than the cost of doing an extra predicted-not-taken branch and having a pipeline flush on the cases where it was wrong. That’s really your performance limit: it has to be faster than a pipeline flush to go to the slow path or it isn’t worth saving the branch.
The proposed instructions are just setting the carry bit in more conditions. That’s… fine, I guess. The problem is that this speedup will really not show up outside microbenchmarks. If you’re doing a load of arithmetic in loops, it will probably be slightly faster. Otherwise, the extra ops to calculate whether to hit the slow path are just exposed to instruction-level parallelism, the branch will be predicted not-taken. This kind of thing is precisely what modern CPUs are aggressively optimised for. I doubt you’d actually see more performance from these magic instructions in anything other than a synthetic benchmark.
The Smalltalk-80 text view hit this problem whenever the height of the text reached 32768 pixels. Performance suddenly went off a cliff. One of my summer-intern tasks was to find a workaround, so I hacked a subclass that used fixed line heights and measured everything in lines not pixels.
Fun. I found a related bug in NSTableView that was present in both GNUstep and Cocoa. Both (at least, I assume both - they had the same behaviour, but I could only look at the code for GNUstep) calculated the place to start drawing each row by multiplying the height of a row by the row index. Both used a float to calculate the address. That gave a 24-bit mantissa and so you started to hit rounding errors after a few tens of thousands of rows. I had a program that had a few million rows and so you’d see increasingly large gaps between adjacent ones. Each row was rendered correctly, just in the wrong place.
I found this during Apple’s 64-bit transition and the s/float/CGFloat/ change combined with defining CGFloat as double in Cocoa fixed it. On GNUstep, we also fixed it on 32-bit platforms by adding some explicit
(double)casts and truncating the result.In theory, floating-point values have the same graceful degradation property as BigInt—they just become less precise as the values get larger—but in practice the rounding errors matter when you care about the difference between two adjacent things.