My first thought was the double 0 problem you usually get from ones complement, but this actually reserves a sentinel “Nan” value.
The problem there is that part of what makes twos complement so great is that the representation also makes basic addition, subtraction, shifts, etc very efficient to implement.
The other problem here is the gratuitous and unnecessary use of undefined behavior. UB is bad, and is meant to exist for things that cannot be specified, e.g using an object after it has been freed, accessing memory through the wrong type, etc. obviously the C and C++ committees decided to abandon that and create numerous security vulnerabilities over the years. If you don’t want to constrain the implementation the correct course of action is unspecified behavior. That is the implementation can choose what to do, but it must always do that, the compiler can optimize according to that but it can’t go “that’s UB so I’ll just remove it”, which I also think is a bogus interpretation of what UB means but that’s what we have.
The desire to say “NaN is UB” means that you can’t test for NaN, because by definition it isn’t defined so the compiler can remove any NaN checks you insert. You might say “we’ll add a built in with defined behaviour”, but that doesn’t help because you gave the compiler the option to remove intermediate NaNs. Eg if(__builtin_isnan(0/0)) doesn’t trip because 0/0 can’t produce a NaN (that would be UB), so therefore the value passed to isnan can’t be NaN, so therefore isnan is false. This is obviously a dumb example, but it’s fundamentally the behavior you see for anything UB. Imagine isnan(x), which looks reasonable, but if x got it’s value from any arithmetic in the scope of the optimizer (which can include inlining) then isnan is always false.
It’s not giving up the ease of two’s complement addition core operation (ie full adders) as I understood it (after re-reading, it’s late, I on first skim thought they were advocating sign-magnitude) but rather to reserve ~0 as a not-numerically-valid NaN. However efficiency would still be lost of course in that the implementation would need to have checks for NaNness of input, and the overflow/underflow detection would be more complex as well.
But this only matters if we decide to build processors this way to support the representation/interpretation, otherwise it’s going to be software layers above.
I made a mistake in my previous post, it’s not ~0 that is the reserved value, ~0 is still the representation of -1. It is the interpretation of {msb=1, others=0} that is changed to INT_NAN rather than a valid number, but my interpretation is that beyond that, the representation remains two’s complement-like (4 bit example):
The problem isn’t the NAN value being tested it is the preceding arithmetic, because of what UB means. Take
x = y / z;
if (__builtin_isnan(x)) doThingA()
else doThingB()
Because arithmetic producing INT_NAN is UB, then the compiler is free to assume that no arithmetic produce it, which gives the compiler the ability to convert the above into:
x = y / z
doThingB()
Because the product “cannot be NAN” the isnans can be removed.
I think the idea is that you’d only ever assign INT_NAN when you want to set the sentinel value.
However, unlike floating point arithmetic where e.g. 0/0 can result in NaN, under my model, no arithmetic operation on integers can result in INT_NAN as overflow is undefined behavior.
You wouldn’t be checking for INT_NAN after arithmetic: you’d only check before, to see if you have an empty optional. It’s not true that the compiler can remove every NaN check, only the ones immediately following arithmetic. And those don’t make sense in this model since there’s nothing meaningful you could do at that point. If you’re representing optional<int>, there’s no reason to check if your optional is suddenly empty after an arithmetic operation. If you want to prevent overflow from resulting in a NaN, you can insert a precondition before doing any arithmetic.
You wouldn’t be checking for INT_NAN after arithmetic: you’d only check before, to see if you have an empty optional.
No - that handles a parameter passed to you, it does not handle general arithmetic, unless you pre-check everything, e.g:
x = a() + b() / c()
would become
_a = a()
if (isnan(_a)) ???
_b = b()
if (isnan(_b)) ???
_c = c()
if (isnan(_c)) ???
if (_c == 0) ???
x = _a + _b / _c
This is assuming of course the optional case, but this runs into the standard footgun that comes from the frankly stupid at this point belief that having arithmetic edge cases be undefined rather than unspecified, e.g.
In a world where arithmetic on nan is undefined behaviour, as the author specified, the isnan() and printf can be removed. Because we’ve said “arithmetic on nan is undefined behaviour” rather than “unspecified”, the compiler can do the following logic:
1. "1 * _a" is UB if _a is nan. Therefore _a cannot be nan.
2. "1 * _a" is "_a" by standard arithmetic
3. due to 1, isnan(_a) is false
4. if (isnan(_a)) will never be taken due to 3
5. no one uses _a so we can flatten
So we get the above code lowered to
return a()
This is how compilers reason about UB. It’s not “I have proved this path leads to UB so that’s an error”, it’s instead the much worse “I have proved that this path leads to UB, therefore this path cannot happen”.
Yes, this logic has led to security bugs, in numerous projects, and yet language committees continue to assign “UB” to specifiable behaviour.
I thought they meant it would be the usual two’s complement but INT_MIN is a lie.
Edit: In fact, they explicitly say this:
Third, you’re getting an unused bit pattern 0b1’0000000, the old INT_MIN, which you can interpret however you like. While you in principle could turn it into some sort of negative zero for extra symmetry, please don’t
There’s a whole section in the article about this.
Edit: I should really read my comments before clicking on “Post”. I didn’t mean to be dismissive - it’s just that this is exactly what the article is about.
It is quite reasonable to have a first-order language which only has total functions, and where NaNs are legal hardware values but we prove that they are impossible via preconditions. This is not the C++ context of the article, but it is still worth considering.
I’d like to see offset integers where they sort naturally in byte representation. So 00000000 is -128 and it goes up sequentially from there (01111111 is -1, 10000000 is 0, 10000001 is 1). It involves just flipping the first bit.
You can attempt similar transformations on (non-subnormal) floats. I have played with an Avro-esque data format where you can compare (eg sort into order) data directly via memcmp without decoding or understanding the semantics of the data format. I still haven’t decided whether this is an
awesome or terrible idea…
Data structures for ordered lookups can be a lot more efficient when they are specialized to dumb lexicographic order instead of some arbitrary ordering function. Radix trees and my qp-trie depend on lexicographic ordering to give meaningfully ordered lookups.
There are gotchas in any operation once you start pushing against the boundaries of the representation, sure things may be more or less intuitively understandable without underlying knowledge of the implementation, but for a more obvious example, you cannot get the mathematically correct result of UINT16_MAX+1 into a 16 bit unsigned integer. The solution is the same: Select a datatype that fits the data and the operations you want to do on it.
My first thought was the double 0 problem you usually get from ones complement, but this actually reserves a sentinel “Nan” value.
The problem there is that part of what makes twos complement so great is that the representation also makes basic addition, subtraction, shifts, etc very efficient to implement.
The other problem here is the gratuitous and unnecessary use of undefined behavior. UB is bad, and is meant to exist for things that cannot be specified, e.g using an object after it has been freed, accessing memory through the wrong type, etc. obviously the C and C++ committees decided to abandon that and create numerous security vulnerabilities over the years. If you don’t want to constrain the implementation the correct course of action is unspecified behavior. That is the implementation can choose what to do, but it must always do that, the compiler can optimize according to that but it can’t go “that’s UB so I’ll just remove it”, which I also think is a bogus interpretation of what UB means but that’s what we have.
The desire to say “NaN is UB” means that you can’t test for NaN, because by definition it isn’t defined so the compiler can remove any NaN checks you insert. You might say “we’ll add a built in with defined behaviour”, but that doesn’t help because you gave the compiler the option to remove intermediate NaNs. Eg if(__builtin_isnan(0/0)) doesn’t trip because 0/0 can’t produce a NaN (that would be UB), so therefore the value passed to isnan can’t be NaN, so therefore isnan is false. This is obviously a dumb example, but it’s fundamentally the behavior you see for anything UB. Imagine isnan(x), which looks reasonable, but if x got it’s value from any arithmetic in the scope of the optimizer (which can include inlining) then isnan is always false.
It’s not giving up the ease of two’s complement addition core operation (ie full adders) as I understood it (after re-reading, it’s late, I on first skim thought they were advocating sign-magnitude) but rather to reserve ~0 as a not-numerically-valid NaN. However efficiency would still be lost of course in that the implementation would need to have checks for NaNness of input, and the overflow/underflow detection would be more complex as well. But this only matters if we decide to build processors this way to support the representation/interpretation, otherwise it’s going to be software layers above.
This is what I understood and how I still interpret it on re-read. The clearest indication I found is this phrase:
10000000
would be-127
using ones-complement but-0
using sign-and-magnitude.I made a mistake in my previous post, it’s not ~0 that is the reserved value, ~0 is still the representation of -1. It is the interpretation of {msb=1, others=0} that is changed to INT_NAN rather than a valid number, but my interpretation is that beyond that, the representation remains two’s complement-like (4 bit example):
I see it now. Is this a novel idea? Does it have a name?
I think only arithmetic is meant to be UB on INT_NAN. You can still check for it, otherwise it can’t be used as a sentinel value.
The problem isn’t the NAN value being tested it is the preceding arithmetic, because of what UB means. Take
Because arithmetic producing INT_NAN is UB, then the compiler is free to assume that no arithmetic produce it, which gives the compiler the ability to convert the above into:
doThingB()
Because the product “cannot be NAN” the isnans can be removed.
I think the idea is that you’d only ever assign INT_NAN when you want to set the sentinel value.
You wouldn’t be checking for INT_NAN after arithmetic: you’d only check before, to see if you have an empty optional. It’s not true that the compiler can remove every NaN check, only the ones immediately following arithmetic. And those don’t make sense in this model since there’s nothing meaningful you could do at that point. If you’re representing
optional<int>
, there’s no reason to check if your optional is suddenly empty after an arithmetic operation. If you want to prevent overflow from resulting in a NaN, you can insert a precondition before doing any arithmetic.No - that handles a parameter passed to you, it does not handle general arithmetic, unless you pre-check everything, e.g:
would become
This is assuming of course the optional case, but this runs into the standard footgun that comes from the frankly stupid at this point belief that having arithmetic edge cases be undefined rather than unspecified, e.g.
In a world where arithmetic on nan is undefined behaviour, as the author specified, the isnan() and printf can be removed. Because we’ve said “arithmetic on nan is undefined behaviour” rather than “unspecified”, the compiler can do the following logic:
So we get the above code lowered to
This is how compilers reason about UB. It’s not “I have proved this path leads to UB so that’s an error”, it’s instead the much worse “I have proved that this path leads to UB, therefore this path cannot happen”.
Yes, this logic has led to security bugs, in numerous projects, and yet language committees continue to assign “UB” to specifiable behaviour.
K has this “INT_NAN” value, spelled
0N
. You can tell it uses the0b1000
representation because of how it wraps around:https://kparc.com/k
As the person who added time.Duration.Abs() to Go after being bitten by this issue, I have to admit, I love this idea.
the symmetry would make there be two zeros, or no zeros, though.
I thought they meant it would be the usual two’s complement but
INT_MIN
is a lie.Edit: In fact, they explicitly say this:
well now you have a NAN value, with absorbtion or is it just undefined what happens to it?
There’s a whole section in the article about this.
Edit: I should really read my comments before clicking on “Post”. I didn’t mean to be dismissive - it’s just that this is exactly what the article is about.
apologies. I did not bother to read the article.
It is quite reasonable to have a first-order language which only has total functions, and where NaNs are legal hardware values but we prove that they are impossible via preconditions. This is not the C++ context of the article, but it is still worth considering.
I’d like to see offset integers where they sort naturally in byte representation. So 00000000 is -128 and it goes up sequentially from there (01111111 is -1, 10000000 is 0, 10000001 is 1). It involves just flipping the first bit.
You can attempt similar transformations on (non-subnormal) floats. I have played with an Avro-esque data format where you can compare (eg sort into order) data directly via memcmp without decoding or understanding the semantics of the data format. I still haven’t decided whether this is an awesome or terrible idea…
Data structures for ordered lookups can be a lot more efficient when they are specialized to dumb lexicographic order instead of some arbitrary ordering function. Radix trees and my qp-trie depend on lexicographic ordering to give meaningfully ordered lookups.
There are gotchas in any operation once you start pushing against the boundaries of the representation, sure things may be more or less intuitively understandable without underlying knowledge of the implementation, but for a more obvious example, you cannot get the mathematically correct result of UINT16_MAX+1 into a 16 bit unsigned integer. The solution is the same: Select a datatype that fits the data and the operations you want to do on it.
I’ve never seen this notation before. Is that operator just a stand-in for “some arithmetic operation”?
Yes,
¤
is sometimes used in place of “any binary operator”.Thanks!
It’s nice to see a refreshing new take on old fundamentals!