Not entirely true. There is a mathematical structure known as a wheel (https://en.wikipedia.org/wiki/Wheel_theory) where division by zero is well-defined, where it maps to a special “bottom” element. I have never seen anyone use this, or to seriously develop “wheel theory”, but it is fun to know that it exists!
I actually wrote my Bachelor thesis about this. See chapter 3.1, where I actually proved that division by zero is well-defined in terms of infinite limits in this projectively-extended form of real numbers (a “wheel”). See Definition 3.1, (3.1e) and Theorem 3.6. It all works, even dividing by infinity! :D
What I noticed is that you actually do not lose much when you define only one “infinity”, because you can express infinite limits to +infinity or -infinity as limits approaching the single infinity from below or above (see chapter 3.1.1).
Actually this number wheel is quite popular with next generation computer arithmetic concepts like unums and posits. In the current Posit standard, the single infinity was replaced with a single symbol representing “not a real” (NaR). It’s all very exciting and I won’t go into it here, because I couldn’t do it justice just how much better posits are compared to floats!
One thing I’m pretty proud of is the overview in Table 2.1, which shows the problem: You really don’t need more than one bit-representation for NaN/NaR, but the IEEE floating-point numbers are very wasteful in this regard (and others).
While only 0.05% make up NaN-representation for 64 bit (double) floats, they make up 0.39% for 32 bit (single) floats and 3.12% for 16 bit (half) floats! The formula to calculate the ratio is simple: If n_e is the number of bits in the exponent and n_m is the number of bits in the mantissa, then we have 2^(1+n_e+n_m) floating point numbers and 2^(n_m+1)-2 NaN representations. To get the NaN-percentage, you obtain the function “p(ne,nm) = 100.0 / (2^(ne)) * (1 - 2.0^(-nm))”.
Mixed precision is currently a hot topic in HPC, as people move away from using doubles for everything, given they are often overkill, especially in AI. However, IEEE floats are bad enough for >=32 bits, let alone for small bit regimes. In some cases you want to use 8 bits, which is where IEEE floats just die. An 8 bit minifloat (4 exponent bits, 3 mantissa bits) wastes 5.5% of its bit representations for NaN. This is all wasted precision.
The idea behind posits is to use tapered precision, i.e. use a mixed-bit-length exponent. This works really well, as you gain a lot of precision with small exponents (i.e. values near 1 and -1, which are important) but have a crazy dynamic range as you can actually use all bits for the exponent (and have implicit 0-bits for the mantissa). In the face of tapered precision, the custom float formats by the major GPU manufacturers just look comically primitive.
You might think that posits would be more difficult to implement in hardware, but actually the opposite is true. IEEE floats have crazy edge-cases (subnormals, signed 0, etc.) which all take up precious die space. There are many low-hanging fruits to propose a better number system.
Sorry for the huge rant, but I just wanted to let you know that “wheel theory” is far from obscure or unused, and actually at the forefront of the next generation of computer arithmetic concepts.
Though if I understand the history correctly, the original intent of leaving tons of unused bits in NaN representations was to stash away error codes or other information that might be generated at different steps in a numerical pipeline, right? They never ended up being actually used for that, but what did happen eventually is that people started stashing other info into them instead, like pointers, and we got the NaN-boxing now ubiquitous in dynamic language runtimes like JS and LuaJIT. So it’s less a mistake and more a misdesign that turned out ok anyway, at least for the people doing things other than hardcore numeric code. That said, you can’t really NaN-box much interesting info inside an 8-bit float, so the IEEE repr is indeed wasteful there, especially when the entire goal of an 8-bit float is to squeeze as much data into as little space as possible.
Out of curiosity, does the posit spec at all suffer for having a single NaR value instead of separate NaN and infinity values? I vaguely understand how you can coalesce +inf and -inf into a single infinity and it works out fine, but when I think about it in terms of what error cases produce what results, to me infinity and NaN express different things, with NaN being the more severe one. Is it just a matter of re-learning how to think about a different model, or are there useful distinctions between the two that posits lose?
As far as I know, there was no original intent to allow metadata in NaN-representations and JS/LuaJIT just made smart use of it. It’s always the question what you want your number format to be: Should it be able to contain metadata, or only contain information on a represented number? If you outright design the format to be able to contain metadata, you force everybody’s hand because you sacrifice precision and dynamic range in the process. If you want to store metadata on the computation, I find it much more sensible to have a record type of a float and a bitstring for flags or something. I see no context where outside of fully controlled arithmetic environments, where you could go with the record type anyway, one would be able to make use of the additional information.
Regarding your other point: Posits are not yet standardized and there’s some back and forth regarding infinity and NaR and what to use in posits, because you can’t really divide by zero, even though it’s well defined. I personally don’t see too much of an issue with having no infinity-representation, because from my experience as a numerical mathematician, an unexpected infinity is usually the same as a NaN condition and requires the same actions at the end of the day, especially because Infs very quickly decay into NaNs anyway. This is why I prefer NaR and this is what ended up in the standard.
The only thing I personally really need in a number system is a 100% contagious NaR which indicates to me that something is afoot. An investigation of the numerical code would then reveal the origin of the problem. I never had the case where an infinity instead of a NaN would have told me anything more.
To be completely clear: Posit rounding is defined such that any number larger than the largest posit is rounded to the largest posit (and the smallest accordingly). So you never have the case, in contrast to IEEE floats, where an input is rounded to NaR/+-infinity. Given +-infinity is by construction “not a real”, I find it to be somewhat of a violation to allow this transition with arithmetic operations that are well-defined and defined to yield only reals.
Dropping infinity also infinitely reduces the necessary complexity for hardware implementations. IEEE floats are surreal with all their edge cases! :D
I vaguely understand how you can coalesce +inf and -inf into a single infinity and it works out fine
But you lose some features. My vague understanding is that +/-0 and +/-inf exist to support better handling of branch cuts. Kahan says:
Except at logarithmic branch points, those functions can all be continuous up to and onto their boundary slits when zero has a sign that behaves as specified by IEEE standards for floating-point arithmetic; but those functions must be discontinuous on one side of each slit when zero is unsigned. Thus does the sign of zero lay down a trail from computer hardware through programming language compilers, run-time support libraries and applications programmers to, finally, mathematical analysts.
Yes, this was the post-justification for signed zero, but it creates much more problems than it solves, creating many many special rules and gotchas. If you do proper numerical analysis, you don’t need such things to hold your hand. Instead, given it’s totally unexpected for the mathematician, it leads to more errors.
It’s a little known fact that Kahan actually disliked what the industry/IEEE did to his original floating point concepts (I don’t know how he sees it today), and this is not the only case where he apparently did some mental gymnastics to justify bad design afterwards to save face in a way.
I had never heard of that concept. I want to share how I understand dividing by zero from calc2 and then relate that back to what you just shared.
In calc 2 you explore “limits” of an equation. This is going to take some context, though:
Understanding limits
To figure out the limit of 1/x as x approaches 1, you would imagine starting at some number slightly greater than 1, say 1.1 and gradually getting smaller and checking the result:
1/1.1
1/1.01
1/1.001
etc.
But that’s not all. You also do it from the other direction so 0.9 would be:
1/0.9
1/0.99
1/0.999
etc.
The answer for the “limit of 1/x as x approaches 1” is 1. This is true because approaching 1 from both directions converge to the same number (even if they never actually quite reach it). Wolfram alpha agrees https://www.wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+1
But, limits don’t have to converge to a number, they can also converge to negative or positive infinity.
Limit of dividing by zero
Now instead of converging on 1, let’s converge on zero. What is the “limit of 1/x as X approaches zero”?
We would check 0.1 and go down:
0.1
0.01
0.001
etc.
And as it goes up:
-0.1
-0.01
-0.001
etc.
The problem here is that coming from the top and going down the result approaches (positive) zero, and starting at the bottom and going up, the result approaches negative zero, which is a thing I promise https://en.wikipedia.org/wiki/Signed_zero. Since these values don’t converge to the same value, the answer is unknown it cannot be both answers, therefore division by zero (under this model) is unknowable and wolfram alpha agrees https://wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+0.
As a “well actually” technical correctness note. I’m explaining this as I intuit it versus in reality 1/x as x approaches 0 goes to positive infinity and negative infinity. I know it has something to do with taylor expansion, but I’ve been out of school too long to explain or remember why. Even so, my explanation is “correct enough” to convey the underlying concept.
Wheel
As you get into higher mathematics I find that it feels more philosophical than concrete. There are different ways to look at the world and if you define the base rules differently than you could have different mathematical frameworks (somewhat like, but not exactly the same, how there is “regular” and “quantum” physics).
It looks like wheel said “negative and positive infinity aren’t different” which is quite convenient for a lot of calculations and then suddenly 1/x does converge to something when it goes to zero.
limits don’t have to converge to a number, they can also converge to negative or positive infinity.
“As a “well actually” technical correctness note”: if one is working in the real numbers, which don’t include infinities, a limit returning infinity is more strictly called “diverging” to positive or negative infinity. A limit can also “diverge by oscillation”, if the value of the expression of which the limit is taken keeps changing for ever, like sin(x) as x tends to infinity.
What this says is that the limit of 1/0 is not defined, not that 1/0 itself is undefined. Consider sin(x)/x. If you do the math, the limit as x→ 0 is 1, but sin(0)/0 = 0/0.
Another interesting example is the Heaviside function, x < 0 ? 0 : 1. Limit → 0 from the left is 0, limit from the right is 1, so the limit doesn’t exist. But the function is well defined at 0!
You can express limits as approaching from a direction though, can’t you? So you can say that lim -> +0 is 1 and lim -> -0 is 0. It’s not that the limit doesn’t exist, but a single limit doesn’t exist, right?
Why is this all way more fun to think about now than when I was taking calc 1 and actually needed to know it?
To me, a more noteworthy one is that the limit of pow(x, x) as x approaches zero is 1. x/x = 1 for almost all values of x, but pow(x, x) = 1 is much rarer.
There’s one additional problem with Optane: because it relies on cache flushes for persistence, it isn’t cache coherent. This means that you’re effectively limited to doing writes from a single core to any given cache-line-sized block of Optane memory. Building anything that defines a total ordering using the interaction of the CPU’s memory model and the flush-based model for NVRAM is an open research problem and a lot of papers are published that do it wrong. Expecting normal programmers to get it right is a fantasy.
I read that section. It talks about the need to flush the cache and that being a serialising operation on the current pipeline, but it doesn’t mention anything about shared memory between cores.
I would guess the main reason this doesn’t see more use (I have in fact heard of it before, though I don’t remember where) is that it’s not general and it inhibits debugging. The classic approach to constant propagation, using abstract interpretation, has neither of these issues.
FreeBSD hosting OSes (via bhyve) and database servers, seems like a good contender to Linux.
I’ve been using FreeBSD as a DB server for a really long time. Postgres has a couple of knobs, so it doesn’t do the things ZFS takes care of. Then you can fully rely on it, even for high traffic applications.
Or if you want to squeeze out every last bit you go with UFS. But I haven’t done that in a really long time.
Apple’s CoreFoundation library makes extensive use of low-bit pointer tagging. This allows it to store small integers (and maybe floats?), booleans and short strings directly in a pointer without allocating memory.
The encoding gets pretty complex, especially for strings; IIRC there are multiple string encodings, one of which can crunch characters down to 5 bits so it can store a dozen(?) characters in a 64-bit pointer. This sounds expensive, but I assume there are SIMD tricks to speed it up, and it’s still going to be way faster than allocating and dereferencing a pointer.
Any reference on the 5 bit encoding? This is in APIs that get called from Objective C or Swift?
The low bit tagging seems like an obvious win and portable win these days, although I know Lua avoided it because it’s not strictly ANSI C. Rust also got rid of small string optimization early in its life, apparently due to code size
You got it — the reference I remember is from Mike Ash’s blog, which has been dormant for a few years, but the archives are a treasure trove of low-level info about Apple’s runtime.
The CoreFoundation types are exposed in Obj-C as NSString, NSNumber, NSDate, NSValue. They also show up in Swift for bridging purposes, but the native Swift string and number classes are implemented differently (in Swift.)
The various tagging schemes that objc (and by proxy swifts interop) uses are internal implementation details that can (and have) changed so it’s not API. Instead objc_msgSend and family handle it directly - similar to the myriad refcount stores and what not.
I was actually looking at Mike Ash’s blog this week for info on tagged pointers:
How about 5 bits? This isn’t totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q. 5 bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.
I was actually looking at the blog because I was wondering if ref counting in the unused 16 bits of a pointer might be feasible. It would give you up to 65k references, which is more than enough for many (most?) use cases. That would slim down the size of ref counted values and might make them as cache friendly as a GCed value. Might not be thread safe though.
Wow, skimming the rest of the post, this is a lot more subtle than I would have expected, and also relatively recent – OS X 10.10 as of 2014.
If the length is between 0 and 7, store the string as raw eight-bit characters.
If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX”.
If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013”
The five-bit alphabet is extremely limited, and doesn’t include the letter b! That letter must not be common enough to warrant a place in the 32 hallowed characters of the five-bit alphabet
Pretty crazy!
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
For what I’m doing, we have to call into libc more, and pass it C strings, so we don’t control that part. The allocation to decode and make it look like a C string is problematic. Not just slow, but creates an ownership problem.
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
It makes me wonder if that is a headwind for adoption to Swift on other platforms. Is the language so tuned to performance on a single platform that it makes the code difficult to port?
It seems like they are also doing some pretty intricate things with C++ interop. As a outside observer (and a relatively ignorant one at that), it seems like it would be very difficult to generalize some of this work.
This stuff is part of CoreFoundation, not Swift. Swift on Apple platforms has some pretty sophisticated interop / FFI with it, for compatibility with Objective-C code, but that isn’t present in Swift on other platforms.
It shouldn’t be hard to port. In GNUstep, we adopted a compressed strings in pointers encoding some years before Apple (not the 5-bit one. Doing that well requires some analysis of data that I didn’t have access to from run-time profiling of a large set of real-world applications). The interface for iterating over strings allows the caller to provide a buffer. These strings are, by definition, of small bounded length and so converting to a C string for interoperability is trivial with the caller providing the buffer on its stack.
It does have some very nice performance properties. A lot of dictionaries use small strings as keys. If you check the length on the way in and try to convert mutable strings used for lookup to small strings then you know that the result either is or isn’t a small string. This lets you skip a bunch of comparisons after you’ve found the hash bucket.
I haven’t followed Swift closely, but I do have the feeling that the focus is entirely on Apple’s platform, and there are some fairly hard tradeoffs with respect to portability / open source.
Just like I think of Google’s entire stack as a vertically integrated embedded system (hardware up to cloud and apps), Apple seems to be architected in a similar way. They get some pretty big benefits from the vertical integration and control
Andreas Kling mentioned that Apple is a huge inspiration for SerenityOS – basically doing everything yourself and not taking dependencies, which is kind of the opposite of most open source, which is about reuse and portability
It seems like if Swift were going to be popular on Linux or Windows, that would have already happened by now. Looks like it’s about 9 years since the first release now
You can’t put the ref-count in a pointer, because a pointer is a reference. If you increment a count in a pointer, that doesn’t affect any other pointers to the object, so no one else knows you added a reference.
CoreFoundation does (IIRC) store refcounts outside objects. My memory is hazy, but it might reserve a few bits for an internal refcount, and when that pins at its max value, the real refcount moves to a global hash table. The answer probably lies in Mike Ash’s blog. (Swift-native class objects don’t do this, though.)
[Update: just saw your other post below. What CF does uses spare bits in a pointer field inside the object; I thought you meant putting the refcount in pointers to the object.]
The refcount for a normal objc object is kept in a few points of the isa pointer, and then moved to a side table once the refcount exceeds the available bits. The result is that there’s no major memory overhead to the refcount in normal objects.
Yup. Plus the WebKit “Gigacage” and the v8 (Javascript runtime) uses this mechanism for sandboxing. I hear another browser engine is considering something similar.
On Intel, the 5 bit encoding can be optimized by using pdep to expand each 5 bits into a full byte, and pshufb to do a SIMD table lookup. I don’t think Arm has something like pdep though.
I don’t know much about ARM, but I guess you could:
Broadcast the lower and upper 32 bits of the NSString each into a simd register v1, v2.
Shift and mask each lane by a separate amount so that the relevant 5 bits are aligned with 16 bit lanes (and the relevant lanes don’t overlap between v1, v2).
Or v1, v2
Use a tbl instruction to recostruct the original order.
It’s not what led me to start writing the article but it’s definitely relevant. It’s also probably used way more in practice than putting data in the upper bits. Thanks @snej for the additional note - I actually came across some of the Mike Ash blog posts last night too. I’ll add a reference to them in.
For CHERI C, we found a lot of cases of storing things in the low bits (not least in LLVM’s compressed integer pointer pair template). This is a big part of why we support (on big CHERI) out of bounds representation beyond the end of a pointer’s range: so you can store data in the low bits of a pointer to the end of an array (you can’t dereference it without bringing it back).
Most (not quite all) of the places where things stored dat in the top bits, they were creating tagged unions of a pointer and some other data and rarely storing anything in the top bits other than a ‘this is not a pointer’ pattern, which can be replaced by the tag bit on CHERI (giving a full 128 bits of space for other data, something I started to explore but didn’t pursue very far and which I expect to be very useful for dynamic languages). A small number of things stored a a pointee type in the high bits.
Morello separates the value field of the capability into address and flags. With TBI enabled, you can store data in the top 8 bits. I believe this was mostly done to allow MTE composition later. It has some fun implications on arithmetic.
That it is not common practice to ship debug symbols with production binaries is, in my opinion, quite sad and insensible. One argument is obfuscation. That never stopped any reverse engineer. Another is binary sizes. My bin directories are a few piddling gigabytes; they can bear to grow a bit in exchange for a better user experience.
That it is not common practice to ship debug symbols with production binaries is, in my opinion, quite sad
Agreed.
Another is binary sizes […] they can bear to grow a bit in exchange for a better user experience.
Unfortunately this doesn’t apply to everyone. At $WORK, stripping all of the binaries contained in the package we ship to customers results in saving almost a gigabyte. Our customers sometimes try to download our product from behind awful corporate firewalls/antivirii that will randomly interrupt downloads, so shipping the tiniest package we can is very important. A couple of my remote colleagues are also on terribly slow networks where that extra gigabyte would be a real pain when attempting to bisect the whole package itself.
Another example was Debian’s firefox-dbgsym package, which IIRC was also a multi-gigabyte monstrosity. Clearly not something that should be shipped to everyone.
Was coming here to say basically this. Storage is cheap, but transfer can be expensive, and time can be very expensive. We made our life much better at $WORK last year when we split our software into data and code and turned a 2.5 GB download into a 2 GB download that seldom changes and a 0.5 GB download that changes all the time.
You may want to experiment with stripping debug sections and debug symbols but keeping other symbol names. In my experience, that results in a binary that’s almost as small as a stripped binary, but you get function names rather than byte offsets in stack traces. It’s not as nice to work with in a debugger as a binary with full debug info, but it’s much nicer than a completely stripped binary.
With the strip tool, use --strip-debug to strip debug symbols and debug sections but keep other symbols.
Another is binary sizes. My bin directories are a few piddling gigabytes; they can bear to grow a bit in exchange for a better user experience.
In what way is the user experience better? At best, users get better stack traces when things crash, but if you’re recording those stack traces for debugging and know the package version then you can symbolise them later. Unless you’re a developer, you are unlikely to have any use for debug symbols. If you are a developer then you can
Debug symbols are usually 2-10 times larger than the whole of the rest of the binaries. This doesn’t just impact disk size (though it has a huge impact on container start times) it also impacts downloads. You may have FTTP, but for a lot of users, downloading an extra 10-100 MiBs of debug symbols for each package when they update is a big impact.
I just had a look at a FreeBSD system (FreeBSD ships with separate debug info). The /usr/lib directory on this system contains 9.9 MiBs of shared libraries, 27 MiBs of debug info. That’s three times as much disk space and network bandwidth consumed to distribute the debug symbols as the libraries. I’m very happy that that’s an optional install so I can install it on dev systems but not production ones.
I don’t really care about debug symbols, and I wouldn’t want to need +700% of additional disk space used for something I will never use, and without which everything works perfectly fine:
-rwxrwxr-x 1 x x 25M lis 24 13:41 app.without-symbols*
-rwxrwxr-x 1 x x 184M lis 24 13:41 app.with-symbols*
In rare cases where something crashes, I can download debug symbols later (if they’re available), in case I need them.
One argument is obfuscation. That never stopped any reverse engineer.
This is a protection based on cost to reward ratio. It seeds out less determined reverse engineers, but doesn’t affect more determined ones. So it does stop some reverse engineers, just not everyone. It increases the cost of reversing to a point where some of reversers bail out, because cost is higher than the reward.
How often do users need the debug symbols? My educated guess is “almost never”. We complain about bloat all the time, why bloat things further with unnecessary things, then?
Making debug symbols easily and readily available benefits both the end-user (who doesn’t need, nor want them), and the developer (who does, and can install them trivially) is, in my experience, a much nicer practice, with almost the same benefits, at the fraction of the cost.
I don’t. And I don’t find that it is ‘trivial’ to install debug symbols (case in point, else-thread: ‘it seems that packed is currently unusable on Linux’). Something l have noticed other people say they value is open-source contributions; wouldn’t it be a great help if it were as easy as possible to debug, modify, and contribute to open source software?
If it were a simple matter of changing a setting in my system package manager’s configuration file, to get debug symbols, source code, and dirty build artifacts—with the default being as it currently is—that would be one thing. But this is not a feature anyone implements, as far as I know.
And I don’t find that it is ‘trivial’ to install debug symbols
Then that should be fixed, rather than shipping debug symbols by default, which would be useless 99% of the time.
If it were a simple matter of changing a setting in my system package manager’s configuration file, to get debug symbols, source code, and dirty build artifacts—with the default being as it currently is—that would be one thing. But this is not a feature anyone implements, as far as I know.
Debian has a debuginfod server, which gives you trivial access to debug symbols: all you have to do is either have elfutils installed, or export a single environment variable (more information). While making source code similarly easy to access isn’t a thing yet, that’s also simple to automate, and it’s usually an apt-get source away.
It’s a bit more work, yes, but you won’t get gigabytes of near useless stuff by default, and that’s a big win in my book.
Don’t forget about apt-get install somepackage-dbg! (which might require enabling the debug symbols repo? can’t recall.)
Debugging software from Debian repos is a pretty painless experienc. I’ve been much more willing to debug minor bugs in random software since switching to it.
I agree with the sentiment. It would probably make more sense to have the common practice be that package maintainers release software exactly as is described in the article: the stripped binary, and an addon containing all of the debug symbols. Best of both worlds!
I wouldn’t normally post video content featuring myself but this video was particularly well received.
Since it was published, people pointed out two mistakes I made:
Go has been able for a while now to export dynamic libraries. My knowledge was from before that time and I also got confused, thinking that you could not export C ABI functions at all, while in fact you can. That said, having a runtime still makes Go not a viable C replacement in the most direct sense of the expression.
Zig used to only support pointer arithmetic by converting the pointer to an int, applying the operation, and then converting it back to a pointer. Since a few months ago, [*]T (and related) started supporting arithmetic. That’s a pointer type that you don’t touch directly often, as you normally would use a slice (ptr + len).
having a runtime still makes Go not a viable C replacement
What you mean is that Go has a garbage collector.
C has a runtime. It is called “the C runtime” and is traditionally abbreviated as “crt”. On Linux systems with GCC installed, there are files named “crt*.o” somewhere under /usr/lib that are part of the C runtime. This is distinct from and in addition to the standard C library (libc). If I compile the C program “int main() { return 0; }” using GCC, then I get about 2K of code and data, even though I’m not calling any library functions. This 2K of stuff comes from the C runtime. [However, note that I’m producing a dynamically linked executable. If I try using ‘gcc -static’ then I get an executable with 780K of code (it looks like glibc), and I don’t know how to make that smaller.]
Rust also has a runtime, even though the Rust-lang.org home page claims that it does not! If I compile the rust program “fn main() {}” (which references no library functions) then I get a static executable that is over 300K, and that’s due to the Rust runtime. Supposedly most of this is due to the standard panic handler. Here is some documentation about the Rust runtime: https://doc.rust-lang.org/reference/runtime.html, which says that the panic handler is part of the Rust runtime.
Zig seems like the best choice if you want to build static executables with a minimal runtime. I compiled “pub fn main() !void {}”, and got a static executable with 660K of code and data. Twice the size of the corresponding Rust executable. A lot of this runtime code seems to involve runtime safety checks and a panic handler. If I rebuild using ReleaseFast then I get 190K of code, which again includes a panic handler. If I rebuild with “zig build -Doptimize=ReleaseSmall” then I get a much smaller static executable with only 6K of code. I don’t know how to make C static executables this small (on Linux).
I believe he was more interested in whatever tricks the Catholic Church was pulling to keep the Word of God from everyone. So quite apropos to the comment from our calvin.
yeah, I really don’t understand people that think it makes sense to downplay the fact that a language like Go can pause execution, realloc an entire green stack somewhere else and fixup all pointers, while being really fixated on crt.
This answer is like “dry cleaning actually uses liquids”. You’re correct in the strict sense, but also ignoring everything people mean by “having a runtime” in the common-yet-imprecise sense.
Runtimes of C and Rust (and probably Zig’s too, although I’m unsure about their async) are relatively small, non-invasive, and play nicely with other runtimes in the same process. These languages can produce static libraries that are easily usable in programs written in other languages. That’s generally not the case in languages that are said to “have a runtime”, in the sense that the runtime is substantially larger, more involved in execution of the program, and may cause problems if it’s not the only runtime in the process (e.g. if it needs to control all I/O, or track every pointer).
Rust also has a runtime, even though the Rust-lang.org home page claims that it does not! If I compile the rust program “fn main() {}” (which references no library functions) then I get a static executable that is over 300K, and that’s due to the Rust runtime.
That’s due to the std library, which is linked by default if you’re compiling for a hosted target. It’s not part of the Rust language, which is why people say Rust doesn’t have a runtime.
A Rust program that just prints hello world is about 9K:
The Rust Runtime Environment is entirely optional. You can in-fact compile a Rust program that does not reference any of std, alloc or core. You will be stuck with a very restricted environment (similar to what happens if you do this in C).
It should also be noted that when you simply compile a Rust program, the stdlib isn’t LTO optimized or otherwise shrunk down (loadbearing * here). You can disable that and only bring what you need. You can also disable the startup wrapper which handles some early init stuff, you can remove the default panic handler entirely and even disable the OOM handler.
Additionally, running in no-core mode will require you to implement a few core constructs the compiler is looking for yourself, since you’ll be missing quite literally everything that holds rust together (such as operators).
C has a runtime. It is called “the C runtime” and is traditionally abbreviated as “crt”. On Linux systems with GCC installed, there are files named “crt*.o” somewhere under /usr/lib that are part of the C runtime. This is distinct from and in addition to the standard C library (libc). If I compile the C program “int main() { return 0; }” using GCC, then I get about 2K of code and data, even though I’m not calling any library functions. This 2K of stuff comes from the C runtime. [However, note that I’m producing a dynamically linked executable. If I try using ‘gcc -static’ then I get an executable with 780K of code (it looks like glibc), and I don’t know how to make that smaller.]
It sounds like you are describing gcc, not C in general.
Windows also has a C runtime – and worse, it was not distributed with the operating system!
It was called msvcrt.dll as far as I remember – Microsoft Visual Studio C runtime. I remember you had to copy it around to get some programs to work.
This was over 15 years ago – not sure what the situation is like today.
edit: To clarify, C does have a runtime, but you don’t have to use it. Kernels and Windows user space don’t, but Linux user space does.
The Windows kernel doesn’t use the C runtime, as far as I know.
Many/most Windows user space apps don’t use the C runtime. They use C APIs provided by Windows.
But portable ANSI C applications often use the C Runtime DLL I mentioned.
The Linux kernel doesn’t use the C runtime.
For example, printf() is part of the C runtime, but the kernel doesn’t use it. It has its own string formatting routines.
Linux user space uses the C runtime – that’s what GNU is – most apps and CLI tools on Linux link with GNU libc, etc. Or musl libc.
Another consideration is that modern malloc()s have a lot in common with garbage collectors. They have some unpredictable performance characteristics, similar to a GC runtime.
Recognizing that this is a tangent, how many C compilers are in your dataset to judge what’s “typical”? Do clang, tcc, kenc, cproc, lacc, and scc do this too?
I don’t know about hobbyist C compilers, but what do you think calls main? A crt0.o that contains things like _start is pretty standard for mainstream Unix C compilers.
I’m not sure what you’re trying to say here. Are you implying that e.g. tcc, kenc, cproc don’t have any startup or exit code?
C programs expect to get argv, have stdout, and atexit working. These things are part of C and its standard library, and compilers need to insert code that makes them work.
I’m asking if @calvin is referring to the two mainstream Unix C compilers, gcc and clang, in which case his statement could be strengthened from “pretty standard” to “universal.”
I’m describing the situation when you use C to write programs that run under an operating system like DOS, Linux, Windows, MacOS, etc. If your C program has a function called int main(int argc, char **argv), then there is a C runtime that provides the operating system entry point and calls main. The ISO C standard calls this a “hosted execution environment”. The situation where C might not have a runtime is called a “freestanding environment”.
Zig used to only support pointer arithmetic by converting the pointer to an int, applying the operation, and then converting it back to a pointer.
Does zig not model pointer provenance than? There has been a lot of discussion in the rust community how about pointer/int cast break the compilers ability to reason about provenance. As I understand it, you can have either pointer/int casts or pointer provenance, but not both.
As I understand it, you can have either pointer/int casts or pointer provenance, but not both.
That is not quite the case. There are several proposed provenance models for Rust, C and C++, all of which have to have some way to deal with integer/pointer casts.
tl;dr there are two categories of approaches: “PNVI” (provenance not via integers) and “PVI” (provenance via integers).
In PVI, integers carry provenance. So if you cast a pointer to an int and then back, you retain the entire chain of custody and the resulting pointer works exactly as the original did. However you now run into some tricky questions like “What is the provenance of a + b? Or a ^ b?”. This is (from what I can tell, but I’m no expert on this) what CHERI C does: Their uintptr_t retains provenance, and they make some choices about what that means for integer math.
In PNVI, integers do not carry provenance, so you need another way to make pointer->int->pointer casts work. The current favorite seems to be “PNVI-ae”, for “address exposed”, where casting a pointer to an int leaks its provenance to “the environment”. Casting back to a pointer will look for an exposed provenance for that address and if there is one, you get that (and an invalid pointer otherwise).
This avoids the tricky questions of PVI and allows a bunch of patterns that PVI doesn’t, such as the infamous XOR-linked-list. However, it is also extremely painful and slow to implement on platforms that physically manifest provenance such as CHERI.
For regular optimizing compilers however, it’s not a big problem: their aliasing/provenance analysis already has to be able to cope with pointers that “escape” their analysis horizon, be it due to FFI, inline asm or (and yes, this is literally called out as allowed by the C standard) round-tripping pointers through the filesystem via fprintf("%p") and fscanf("%p"). PNVI-ae at worst inhibits their ability to optimize code that does a bunch of int/pointer casts.
Now for Zig, if they adopt these rules as-is, this might mean a more substantial pessimization if int/pointer casts are more idiomatic and used in many more places. If more or less every pointer’s address is escaped, you do lose most of the optimizations allowed by provenance.
Their uintptr_t retains provenance, and they make some choices about what that means for integer math.
Note for those unfamiliar with CHERI: this means uintptr_t is 128 bits large when holding a 64 bit address.
Also, for languages that don’t need to be backward-compatible with code that casts freely between ints and pointers, there is the option to disallow int->ptr casts completely, and instead require supplying provenance explicitly when you want to construct a pointer from an integer. E.g. new_ptr_with_derived_provenance = old_ptr.new_from_address(some_int).
“…is an object in some category,” yes. Also, when all you have is a deductive system, you have a category. When all you have is a directed graph, you have a category. When all you have is an FSM, you have a category.
The original insight that led to this note was from Zelda 3. To traverse a superroom in that game, we traverse some of its rooms. To traverse a room, we start at a cutscene called the door animation, and end at another door. Rooms naturally form a graph and speedrunners rely heavily on the topology of that graph, particularly when playing randomized room layouts. So, one day when I was playing a randomized copy of Zelda 3, I thought about how a TAS script might automatically discover routes through the randomized graph, and realized that there is a complex interplay between the room graph and the item layout.
Methods such as bitwise XOR-ing a register with itself are not detected as breaking the CP
An odd comment. On aarch64, a xor-selfie is architecturally not allowed to be dependency-breaking. (Not technically true, but close enough, and unlike x86 there’s no reason for the microarchitecture to treat it specially.)
RISC-V performs 460,027,962 branches to complete STREAM. This is almost 15% of all instructions executed. If all of these are conditional branches, this is 460 million compare instructions that don’t have to be executed compared to the equivalent program running on AArch64.
Another odd comment. Because:
Compare-and-branch is fused on high-end aarch64 parts
Since aarch64 has flags, it doesn’t always need an explicit additional comparison before a branch
Aarch64 has some compare-and-branch instructions: cb(n)z and tb(n)z
Aarch64 has csel and ccmp, which can reduce instruction count and eliminate branches entirely, improving usable ilp.
More generally, I found the paper somewhat shallow. I would have liked to see an exploration of where the differences come from. The applications they considered are also quite domain-specific.
I completely agree with the four kinds of optimisation, though the first and second are often intertwined. Great discussion on the difference between best case and common case on your data - that’s often critical and is largely overlooked in university classes.
Option 4 is the most interesting one. Apple’s FSEvents mechanism scales a lot better than Linux’s inotify because it tells you only the directories that have changed, not the individual files. You can then look at the modification time of the files to see which have changed since your last epoch for that directory. This also has the advantage that it does some coalescing for free. I’d be inclined to subdivide these into two categories:
Accept less precision in a first round that can be recaptured later.
Accept less precision because you don’t actually care.
FSEvents is an example of the fist case. You don’t care which files are modified because traversing a directory and looking at the modification times is cheap (or, at least, less likely to cause performance problems than blocking writes until an event notification consumer has caught up). In contrast, using lower precision floating point for coordinates in a game because users won’t notice if an object is less than one pixel away from its ‘correct’ position is an example of the latter. The first has some overlap with choosing a better algorithm.
I think there’s a higher-level category, which is ‘reevaluate your requirements’. Some of the biggest performance gains that I’ve made have come from applying the techniques in the article to a system that was already carefully optimised. Between the time that it was written and the time that I’d looked at it, the requirements had changed significantly. Sometimes this applies to the input changing as it interfaces to a different system (you built a thing for processing SMS, now it’s used with WhatsApp, suddenly short 7-bit ASCII strings are replaced with longer UTF-8 sequences). Sometimes it’s that the hardware changed (globals got slower, passing arguments got cheaper, parallelism became the big win, cache misses got really expensive, branch predictor performance became critical).
Indeed. This is also, as I discuss here, an argument in favour of abstraction: maybe low-level requirements can be changed without affecting higher-level ones at all.
If you’re increasing the bit depth of the single greyscale value 1234 5678, wouldn’t it be more accurate to turn it into 1234 5678 7878 7878 (repeating the last byte) rather than 1234 5678 1234 5678? That’s what my intuition says, but I don’t have a formal argument for it.
The three-value and four-value CSS hex color syntaxes are defined as repeating the first hex digit for each channel (RGBA). Color #fa0 is equivalent to #ffaa00, and #fa08 is equivalent to #ffaa0088. There is no CSS syntax that turns two digits for a channel into more digits, though, so we can’t compare CSS’s design to my idea above.
In more detail: first of all, this is in decimal instead of binary/hex, for clarity. 1234 is a 4-digit decimal color, and we want to convert it to 8-digit decimal. Dividing by 9999 (the largest 4-digit decimal value) converts 1234 to a fraction between 0 and 1 inclusive. Multiplying by 99999999 (the latest 8-digit decimal value) converts that fraction to 8 digits. Though you need to do the multiplication before the division because integer math.
In CSS syntax, for RGBA, each digit in “#ffaa0088” is a hexadecimal digit (4 bits). The last byte is 2 of those digits.
In the article, for greyscale, each digit in “12345678” is a binary digit (1 bit). The last byte is all 8 digits. Repeating the last (and only) byte would be “12345678 12345678”.
Moonchild’s sibling comment is a good answer to the accuracy question. I wouldn’t have known myself! CSS colors are actually a good example of the proposed technique in action since each 4-bit hex digit gets expanded to an 8-bit channel intensity by copying. That could’ve been a nice lead to my article.
It supports dynamically-sized values like closures, vectors, strings…
The drawback I see is that the caller keeps the callee’s entire stack frame (not just the returned value) until it returns. And if it too returns a variable-size value, it doesn’t pop the stack either.
This seems like it could cause major stack bloat in some circumstances (especially with recursion…)
No, but they can handle forwarding. If I have a deep call stack that returns an object from the leaf to near the root, this can be done without copies in C++ with guaranteed copy elision. The root caller allocates space, passes the pointer down the stack, and reach return uses the same space. This cannot support variable-sized structures, but supporting variable-sized structures via any non-heap mechanism would be impossible for anything other than a one-deep call. This is an unfortunate programming abstraction because outlining a piece of code changes the semantics. In particular, I’d your language exposes object identity then this will make a bunch of optimisations visible to the programmer.
Putting variable sized objects on the stack increases the amount of metadata that needs to be tracked. If your passing them up the stack with a bunch of copies and a load of metadata then you may well find that heap allocating is faster (if the objects address never escapes, you can delete it without adding it to the GC).
The stack-based allocation shown in the Azul slides doesn’t apply here — we’re talking about objects being returned from a function. Azul’s escape analysis (just like Go’s) would detect that the object’s lifespan exceeds the caller’s, so it would be forced to heap-allocate it.
Escape analysis is very simple and everybody does it. The slide deck describes a more sophisticated mechanism which can in fact cope with escaping objects.
Ah, you’re right. It just requires custom hardware to implement the necessary write barrier. That was in 2006 — I assume it never took off, at least not in any mainstream sense?
It didn’t take off, indeed. That’s not so much a technical indictment of it as an indication that external factors did not work out favourably. For instance, cliff click was also responsible for the c4 garbage collector, which azul uses successfully to this day, but which did not start to enter the mainstream until oracle decided to fund an oss clone of it a few years ago (zgc). There are various sociotechnical factors—pretty much all of them artificial—which disincentivise production of hardware gc support.
Well, the point of my comment is that SQL isn’t as special a language as the author seemed to think. In that sense, a couple exceptions don’t invalidate my point. I bet there are a couple places in SQL where subexpressions are guaranteed to be executed in some order too.
There’s a way of viewing Haskell where even do and guards don’t have “a notation of execution order”, though. The guard above is equivalent to:
if True == True then f 42 else (if False == False then f 43 else undefined)
And
do
x
y
z
is defined to be equivalent to (not sure if I’m getting this exactly right, but you get the idea):
x >>= \x -> (y ==> \y. z y)
In both cases, the required “execution order” can actually be viewed as purely functional code with nesting.
Oh ok, I understand. I suppose that the assertion is that if expressions don’t have side effects, and most things are expressions, then the language mostly doesn’t have side effects. I guess I misunderstood it because I think of expression evaluation as embedding an order of operations. Which is apparently where I went wrong.
Another way to think of it is in terms of equivalences. The blog post gave some equivalences between SQL expressions. In Haskell, these expressions are always equivalent, for any subexpressions X and Y:
X + Y === Y + X
f(X, Y) === f(Y, X)
This is unusual! Very few languages have these equivalences.
On the other hand, haskell loses a lot of equivalences because of laziness and in particular the use of syntactic dependencies. (Fast and loose reasoning isn’t actually correct.)
As someone interested in fast compilers, what are some good register allocation strategies for one pass compilers? I’ve been tinkering on an RV64 backend, but my register allocation is pretty primitive, as I don’t know the lifetime of variables ahead of time.
First of all, SQL, from the get-go, doesn’t have a notion of the execution order of things. Impure functions mess with that notion a bit, but the problem here is the impure functions—they’re the culprit here.
Ok, we can blame random() instead of the optimizer or the SQL language. But then what behavior should we expect from random()?
Evaluate once for the whole query.
Evaluate once per batch of rows.
Evaluate once per partition.
Evaluate once if there are fewer than 1000 rows; otherwise evaluate once per row.
Always return 0.4.
It wouldn’t be useful to permit all of these, so how do we define it in a way that’s both useful and permits optimizations?
I think the argument in this post is dubious at best. Obviously the optimizations that a database query engine performs should be semantics-preserving in the same way that compiler optimizers are. In the case of an ON clause in a join, the semantics are that the expression is evaluated on the cartesian product of the rows from the tables that are being joined. In the previous article, SQLite and CockroachDB optimize incorrectly because they assume that a function that doesn’t depend on the contents of one row is effectively constant for any row from that table. This isn’t true for random(), but I don’t think there are many functions that are weird in a similar to random() so it’s not a good example from which to generalize like this article does. The date_part() example is not an example of a side effect: it’s a straightforward dependency on the contents of a row, which is the thing that random() lacked, so it won’t break the push-down optimization in the same way.
If you push down date_part you might evaluate it on rows that it would not get evaluated on if you do not push it down because the other side of the join might be empty.
Oh right, now I get the point about empty joins - I thought it was performance, I missed the effect on correctness. There are lots of partial functions … I am dismayed that the Postgres docs for mathematical functions don’t appear to say what happens if you divide by zero or take the logarithm of a negative number (maybe I am looking in the wrong place?). Anyway, I was wondering if they raise an error or return NULL. In the context of a join, it would make the planner’s job easier if errors became NULL; maybe there’s a wrapper function to NULLify errors.
They raise errors. You could easily write your own safe_div and safe_log functions, and more importantly declare them immutable the same as would be for PG-native versions, but in a where context you could even more easily check in another predicate before proceeding to divide or take the log since ((x <> 0) and (n / x > 1)) only needs to fail the first branch.
Obviously the optimizations that a database query engine performs should be semantics-preserving in the same way that compiler optimizers are.
If the semantics aren’t defined (different databases don’t agree on the them), what’s there to preserve? This is essentially undefined behaviour. It’s not as harmful as UB in C, but very similar in that the optimizer can produce different results because the results you got the first time around weren’t guaranteed anyway.
Similarly, if you use LIMIT 1 without an explicit ordering, you might get different results at different times, even with the same query and the same data because the optimizer might select a different plan (after a VACUUM or an ANALYZE, for example).
One could say these are cracks in the declarational facade that databases try to put up. The abstraction leaks and the underlying implementation/execution strategy shines through.
You can have a nondeterministic semantics. That is a perfectly fine thing to do. The specification describes a range of legal behaviours and intermediate states, instead of a single legal behaviour; and a well-formed program should ensure that, under all legal behaviours, it ends up in the same final state (or, rather, that all allowable final states maintain application invariants; for example, two newly added rows could get added in different orders and so have different ids, but that probably doesn’t matter).
I guess my point there, that I should have made more explicitly, was that I think answering this question in a coherent way is incompatible with a declarative language. Or at least the ones we are stuck with today. I could imagine a language designed to have a sane answer to this question and still have a lot of the good properties that a language like SQL has, but I don’t know what it would look like and I’m not sure SQL is it.
Yeah, I agree random() doesn’t fit well into the language. :(
But I bet you could rescue it! You could find a definition that explains the behaviors people expect, but is still declarative.
For example here are some things I expect about random():
multiple random() calls in the same expression are independent
random() calls on separate rows are independent
When a subquery uses random(), the subquery result for different rows of the outer query are independent.
multiple requests are independent (even though the query text is the same).
So you could define random() as a function that takes a whole bag of rows, and marks each row with a value such that the overall distribution comes out right. Or you could think of it as a function that maps each unique row ID to an independent random value–even though these dependencies aren’t visible in the user syntax.
there was no way to get Cinebench to use more than 64 CPU cores. Windows used all the cores, it was just Cinebench that seemed to have an issue
My vague recollection is that windows divides cores into groups of 64 so that it can represent an affinity set with a 64-bit integer, so if you want to use more than 64 cores at a time, you have to do something special. If that’s true, then it is sort of the os’s fault.
Does anyone have a strong feeling about this actually improving security? I’m seriously skeptical, but I haven’t given it a ton of thought. Any exploit devs care to comment?
Well, the obvious answer is that it would lower the surface area a little bit, right? Instead of worrying about libc and syscall(2) you just worry about libc. Whether that improves security I don’t know but considering the resources OpenBSD has, it is one less thing to worry about or deal with.
I feel the security benefits are theatre, but it enables better backwards compatibility in the long run - Windows and Solaris for example, have had good dynamic library based compatibility for years. Of course, OpenBSD doesn’t care about that part…
Right but I’m trying to understand if this mitigation would actually make exploitation of an existing vulnerability difficult. It feels like a mitigation without a threat model.
Go read up on ROP and stack pivots, especially on amd64 and variable length instruction architectures that make it impossible to completely remove ROP gadgets. There are very clear threat models already defined based on arbitrary code execution, especially remotely. Reducing the syscall surface area as much as possible minimizes the success probability.
I’m surprised no one has yet decided to use a separate stack for data, especially on x86-64 with more registers, a register parameter passing paradigm and a larger memory space, leave RSP for CALL/RET and use another segment of memory for the “stack frame”. That way, overwrites of the data stack won’t affect the return address stack at all. Given how fast 32-bit systems have largely disappeared on the Internet, I think such an approach would be easier (and faster) than all the address randomization, relinking, stack canaries, et. al.
Or (as painful as this is for me to say), just stop using C! Has anyone managed to successfully exploit a program in Rust? Go?
A similar feature is called “shadow stacks” - return addresses get pushed both to the standard stack and a separate return stack, and the addresses are checked to match in the function epilogue. Its supported in all the big C compilers. I can’t speak to how often it’s actually used.
Further afield, Forth also exposes fully separate data and return stacks. So it’s been done.
As far as performance goes, you’re losing an extra register for the other stack, which can be significant in some cases, and also memory locality. Cost varies but has been measured around 10%.
In addition to the safe stack / shadow stack work, it’s worth pointing out SPARC. SPARC had a model of register windows arranged in a circle. You had 8 private registers, 8 shared with the caller and 8 shared with the callee (you could reuse any caller-shared one you weren’t using for return and all callee-shared ones between calls). The first S in SPARC stood for ‘scalable’ because the number of windows was not architecturally specified. When you ran out, you’d trap and spill the oldest one (you should do this asynchronously, but I don’t believe the implementations that typeid ever shipped). This meant that the register spill region had to be separate from the stack. This gave complete protection from stack buffer overflows turning into ROP gadgets.
Pure software variants have been tricky to adopt because they’re incredibly ABI disruptive. Anything that creates stacks needs to allocate space. Anything that generates code needs to preserve an extra register (not just across calls but also when calling other functions). Anything that does stack unwinding needs to know about them.
If you’re compiling everything together and static linking, it’s feasible.
I know about the register windows on the SPARC, but I never really dove into how it interacted with the operating system with regards to context switches (process or threads)—it seems like it could be expensive.
Switching threads was very expensive. A large SPARC core could have 8+ register windows, so needed to save at least 64 registers. That’s fairly small in comparison with a modern vector extension, but still large.
On later superscalar designs, it actually wasn’t that bad. Modern processors allocate L1 lines on store, so spilling a full cache line is quite cheap. If you’re doing this often, you can even skip the store buffer and just write directly from registers to the cache line. I think switching from a thread required spilling all used register windows, but resuming a thread just required reading back the top and then the others could be faulted in later. SPARC had very lightweight traps for this kind of thing (and TLB fills - 32-bit SPARC had a software-managed TLB, though later SPARCs were spending 50% of total CPU time in that trap handler so they added some hardware assist with 64-bit versions).
I think the biggest mistake that SPARC made was making the register window spill synchronous. When you ran out of windows, you took a synchronous fault and spilled the oldest one(s). They should have made this fully asynchronous. Even on the microarchitectures of the early SPARCs, spilling could have reused unused cycles in the load-store pipeline. On newer ones with register renaming, you can shunt values directly from the rename unit to a spill FIFO and reduce rename register pressure. I think that’s what Rock did, but it was cancelled.
But nothing about this mitigation is unknown or randomized further, as far as I can tell. I don’t see how brop is important here or how it would be impacted by this. Maybe the attacker needs to be a bit pickier with their gadgets?
Any ROP technique needs to find and assemble gadgets. This would remove one possible type of gadget, making it harder to achieve arbitrary syscall execution especially in light of other mitigations like pledge(2) or pinsyscall(2).
Assuming they do this to all interesting syscalls, it would make shellcode writing a bit more painful as now you actually have to deal with ASLR to find the libc versions. That said, ASLR isn’t a significant barrier in 99% of cases so its not going to combat anything targeted or skilled attackers. However it seems it would also disallow static linking libc, which is a huge con for such minor gain IMO.
Disclaimer: Its been almost a decade since I’ve attacked an OpenBSD machine on the job, so there may be additional protections im not aware of that make this change a more valuable protection.
Moreover, their parallel nature makes them memoryless and stateless. This translates to: “You can’t store or share data between pixels or shader executions.”
This might be true in the browser (I have no idea), but in general it is not; you can run full-on c and c++ on the gpu, and fine-grained concurrency and coordination are an important part of some gpu algorithms (such as scans).
This didn’t even list my favourite Go footgun. Updates to fields of slice types are not guaranteed to be atomic and it is undefined behaviour to write to them concurrently from two threads (goroutines that are scheduled in parallel). If you do, a reader in another thread may observe the base of one slice and the bounds of another. If you are in an environment where you’re allowed to run arbitrary Go code but not the unsafe package, you can do this intentionally to escape from the Go memory safety model. If you’re not writing malicious code, the fact that Go’s type system has no notion of linearity means that you have to be very careful to not accidentally alias objects between threads and trigger this kind of thing intermittently, in a fashion that’s almost impossible to reproduce, let alone debug.
It’s a deeply special kind of language that manages to have a global garbage collector and still not be memory safe.
Because it would be expensive and they can’t be arsed, the Go people are fundamentally C people, it’s the fault of the developer if they fall into the language’s traps.
Slices are not the only affected type either, interfaces and maps also suffer from data races for sure. Generally speaking data races undermine memory safety and there’s a fair number of possible ways to get data races in Go: https://www.uber.com/blog/data-race-patterns-in-go/
The Go memory model is that everything is thread unsafe unless you put it behind a mutex or a channel. Why would slices be different from everything else?
That does not contradict my comment in any way. And the only thing of interest in that essay is that rsc is self-servingly contradictory:
Go’s approach sits between [C’s data races = invalid program and java’s data races = defined and safe memory behaviour]. Programs with data races are invalid in the sense that an implementation may report the race and terminate the program. But otherwise, programs with data races have defined semantics with a limited number of outcomes, making errant programs more reliable and easier to debug.
is immediately followed by
Note that this means that races on multiword data structures […] can in turn lead to arbitrary memory corruption.
I mean, it’s in the middle. It’s not classic UB nasal demons, but it could be a cause of data corruption. Maybe it’s a bad choice, but that’s what he chose and you can read his blog series to try to work out why he chose it.
I guess I’m just saying, it was a deliberate engineering tradeoff not “dysfunctional” because they didn’t “wrap it in a critical section or something”.
It’s funny how he first pays extensive lip service to Tony Hoare’s philosophy, especially this one:
As well as being very simple to use, a software program must be very difficult to misuse; it must be kind to programming errors, giving clear indication of their occurrence, and never becoming unpredictable in its effects
only to then cheerfully update the docs to explain that you can get arbitrary memory corruption if you “misuse” the program (well, language).
n := 0
for e := list; e != nil; e = e.next {
n++
}
i := *p
*q = 1
All loops must terminate. Therefore we can assume this loop must terminate. Therefore we can rewrite the loop to access *p or *q before the loop happens as an optimization. (But what if it’s an infinite loop? Well, that’s UB, so we can assume it won’t.)
Go is not any less safe than C/C++ and it specifically rules out some of the UB “optimizations” in C/C++ that give UB a bad name. So, slightly safer than C/C++, less safe than other languages.
I also think “wrap it in a critical section or something” is really breezing past how difficult this would be. Every slice/interface/map would need some kind of mutex or the type system would have to be radically different to prevent aliasing. You’re either talking about a huge GIL style performance hit or a totally different language with a much stronger type system.
Every slice/interface/map would need some kind of mutex or the type system would have to be radically different to prevent aliasing. You’re either talking about a huge GIL style performance hit or a totally different language with a much stronger type system.
I doubt it would be a “huge GIL style” performance impact - it’d be a mutex per slice, not a global mutex over all slices. There shouldn’t be much contention on these mutexes if you’re using it like “you’re supposed to”, anyway!
It seems even these days “it’s not fast enough” is still sufficient argument to avoid important safety features. Which is strange, because runtime bounds checking is part of Go. That’s also quite a big performance impact.
I guess it’s just a matter of time before someone opens a can of CVEs on some large Go codebases, and then we can have this whole discussion again.
Performance. Assigning a slice-typed variable is a common operation. If you had to acquire some kind of lock every time that you couldn’t prove non-aliasing then it would slow Go code down a lot. As @masklinn says interface-typed fields in Go are a pair of a pointer to the object and a pointer to the type, so it’s possible to get type confusion in these by racing and reading one type and the other value.
For maps it’s somewhat more excusable. A map is a complex data structure and updating a complex data structure concurrently is a bad idea unless it’s explicitly designed for concurrency. I think the map implementation used to be in C (special, Plan 9-flavoured C), but it might be pure Go now. If it is, then races there should just leave it in a broken state (just as updating any other non-concurrent complex data structure with data races can), rather than break the fundamental guarantees of the environment.
It’s far less excusable for interface pointers and slices because these are value types that are carefully designed to look like they are primitive values. You pass them around just as you would an integer. If two threads write to the same integer variable at the same time, one will win the race and you’ll see a value that makes sense. This is not the case with other Go types.
The depressing thing is that a type system that understands isolation can address this. When I write parallel code, there’s one rule that I want to follow: no object is both mutable and aliased between threads. Go provides absolutely nothing in the type system to help you spot when you’ve broken this rule. For a language that was designed from the ground up for concurrency, this is inexcusable. This is probably why most of the successful Go programs that I’ve seen use it as statically compiled Python and don’t use goroutines at all (or in a handful of very special cases).
You pass them around just as you would an integer. If two threads write to the same integer variable at the same time, one will win the race and you’ll see a value that makes sense.
I learned Go from the Go Tour back in ~2011 or so; IIRC, slices and interfaces were explained as being fat pointers or tuples, so I’ve always thought of them as such rather than thinking of them as integers. As a result, I’ve never really run into these problems. I’m very curious how often people are running into this? One of the things I like about Go is it’s pretty straightforward how things work, so you can intuit about stuff like this. I suppose if someone was writing Go like many people write Python or JavaScript–with no idea about the underlying machinery–this might get people into trouble, but on the other hand I don’t know how you can write Go without understanding some basics about memory layout, pointer traversal, etc. Maybe I’ve just been doing this for too long to empathize well with beginners…
Performance. Assigning a slice-typed variable is a common operation. If you had to acquire some kind of lock every time that you couldn’t prove non-aliasing then it would slow Go code down a lot.
How often is that? Go should be in a pretty good position to reason about aliasing.
I agree with you about the way Go should have… the thing Go should have done. But it would probably be more on-brand for them to fix this by designing atomic slices that avoid atomic operations until they are actually contended. Do we know if they’ve tried that?
How often is that? Go should be in a pretty good position to reason about aliasing.
Why? The type system does not give the compiler any information that it can use to make that kind of decision. If the slice is a local and has not been address taken, it can be assumed to be safe. In pretty much any other situation, the compiler has to assume that it can have escaped to a concurrent context.
I agree with you about the way Go should have… the thing Go should have done. But it would probably be more on-brand for them to fix this by designing atomic slices that avoid atomic operations until they are actually contended. Do we know if they’ve tried that?
I think they were very reluctant to introduce atomics at all, they certainly don’t want more. They want you to design code where objects are held by a single goroutine and you never do racy updates.
Why? The type system does not give the compiler any information that it can use to make that kind of decision. If the slice is a local and has not been address taken, it can be assumed to be safe. In pretty much any other situation, the compiler has to assume that it can have escaped to a concurrent context.
TBF in most cases slices are passed by value, in which case there is aliasing on the backing buffer (and there can be data races on that depending what it stores), but there’s no aliasing on the slice itself. Most issues would occur with slices getting captured by a closure or go statement in which case they essentially “had their address taken”.
A bigger issue, I would think, is that you’d need to add a tripleword pseudo-atomic which pretty much means you need a lock (interfaces are only a doubleword so it’s a bit better). And while in theory you could use the low bits of the pointer as your lock flag I’m not sure there’s such a thing as a masked compare exchange not to mention a sub-byte futex / mutex?
Sorry, I still don’t understand. I have used tagged pointers with CAS many times and I don’t see the problem. Find an unused high or low bit in the pointer, save the initial value of the pointer, mask and test the bit you care about, and if the test passed then set/clear that bit and CAS the old value to this new value. Depending on the context, if the CAS fails then either abort or keep retrying (maybe with backoff) until it succeeds.
Recent revisions of x86 and arm have fast two-word atomic reads and writes (avx and armv8.1 respectively). But more obscure architectures do not, so there are tradeoffs w.r.t. performance portability.
Because accessing anything across threads is already undefined behavior, and your idea would murder the performance of correct code for no real reason. Writing correct code is in no way difficult, and if you do happen to slip up, that’s why your app has a comprehensive test suite, and why you run go test -race in CI, which puts everything into “I want to be painfully slow” mode, but bombs as soon as you have a single cross-thread access without a synchronization edge.
If I want “arbitrary memory corruption”, I already know where to go for that. Do you really want memory-unsafety in a language that is marketed for externally-facing web servers? Java demonstrates that you can allow data races without compromising the integrity of the runtime itself.
I’ve been working with my current company and doing Go for about 9 years now. We’ve written several nontrivial services, a number of which handle more than 10k RPS. We’ve had zero production issues caused by data races, and… maybe two times in those nine years that someone wrote a race bug, which was caught by automated testing before it made it out the door. It’s not high on my list of concerns. The kinds of access patterns that could even theoretically run into this problem just don’t exist in our code, because people who understand the language have no inclination to write them.
Holy crap that’s dysfunctional! Why don’t they wrap it in a critical section or something to avoid this kind of bug?
I don’t know the specifics of this issue, but I do know that you’re not supposed to share a variable between go routines like that. If two go routines must work with the same data, you’re supposed to let them communicate it through a channel.
Whether that means it is OK to leave in a footgun like that is a different matter. But this is one of the many “problems with Go” that I somehow magically never encounter in real life.
I don’t see “so many footguns”. I see two things. The bit about append is something you learn in the first 30 minutes of using the language. And the other thing… I don’t even know. I can’t figure out what he thinks he’s doing there, or trying to prove, or what it might have to do with anything that might happen to anyone in the real world, because the presentation of the concept is basically nonexistent. And the “advice” about constructing slices is pure nonsense; there’s no justification given for it because none is possible.
Not entirely true. There is a mathematical structure known as a wheel (https://en.wikipedia.org/wiki/Wheel_theory) where division by zero is well-defined, where it maps to a special “bottom” element. I have never seen anyone use this, or to seriously develop “wheel theory”, but it is fun to know that it exists!
I actually wrote my Bachelor thesis about this. See chapter 3.1, where I actually proved that division by zero is well-defined in terms of infinite limits in this projectively-extended form of real numbers (a “wheel”). See Definition 3.1, (3.1e) and Theorem 3.6. It all works, even dividing by infinity! :D
What I noticed is that you actually do not lose much when you define only one “infinity”, because you can express infinite limits to +infinity or -infinity as limits approaching the single infinity from below or above (see chapter 3.1.1).
Actually this number wheel is quite popular with next generation computer arithmetic concepts like unums and posits. In the current Posit standard, the single infinity was replaced with a single symbol representing “not a real” (NaR). It’s all very exciting and I won’t go into it here, because I couldn’t do it justice just how much better posits are compared to floats!
One thing I’m pretty proud of is the overview in Table 2.1, which shows the problem: You really don’t need more than one bit-representation for NaN/NaR, but the IEEE floating-point numbers are very wasteful in this regard (and others).
While only 0.05% make up NaN-representation for 64 bit (double) floats, they make up 0.39% for 32 bit (single) floats and 3.12% for 16 bit (half) floats! The formula to calculate the ratio is simple: If n_e is the number of bits in the exponent and n_m is the number of bits in the mantissa, then we have 2^(1+n_e+n_m) floating point numbers and 2^(n_m+1)-2 NaN representations. To get the NaN-percentage, you obtain the function “p(ne,nm) = 100.0 / (2^(ne)) * (1 - 2.0^(-nm))”.
Mixed precision is currently a hot topic in HPC, as people move away from using doubles for everything, given they are often overkill, especially in AI. However, IEEE floats are bad enough for >=32 bits, let alone for small bit regimes. In some cases you want to use 8 bits, which is where IEEE floats just die. An 8 bit minifloat (4 exponent bits, 3 mantissa bits) wastes 5.5% of its bit representations for NaN. This is all wasted precision.
The idea behind posits is to use tapered precision, i.e. use a mixed-bit-length exponent. This works really well, as you gain a lot of precision with small exponents (i.e. values near 1 and -1, which are important) but have a crazy dynamic range as you can actually use all bits for the exponent (and have implicit 0-bits for the mantissa). In the face of tapered precision, the custom float formats by the major GPU manufacturers just look comically primitive. You might think that posits would be more difficult to implement in hardware, but actually the opposite is true. IEEE floats have crazy edge-cases (subnormals, signed 0, etc.) which all take up precious die space. There are many low-hanging fruits to propose a better number system.
Sorry for the huge rant, but I just wanted to let you know that “wheel theory” is far from obscure or unused, and actually at the forefront of the next generation of computer arithmetic concepts.
Oh this is absolutely fascinating, thank you!
Though if I understand the history correctly, the original intent of leaving tons of unused bits in NaN representations was to stash away error codes or other information that might be generated at different steps in a numerical pipeline, right? They never ended up being actually used for that, but what did happen eventually is that people started stashing other info into them instead, like pointers, and we got the NaN-boxing now ubiquitous in dynamic language runtimes like JS and LuaJIT. So it’s less a mistake and more a misdesign that turned out ok anyway, at least for the people doing things other than hardcore numeric code. That said, you can’t really NaN-box much interesting info inside an 8-bit float, so the IEEE repr is indeed wasteful there, especially when the entire goal of an 8-bit float is to squeeze as much data into as little space as possible.
Out of curiosity, does the posit spec at all suffer for having a single NaR value instead of separate NaN and infinity values? I vaguely understand how you can coalesce +inf and -inf into a single infinity and it works out fine, but when I think about it in terms of what error cases produce what results, to me infinity and NaN express different things, with NaN being the more severe one. Is it just a matter of re-learning how to think about a different model, or are there useful distinctions between the two that posits lose?
Thanks for your remarks!
As far as I know, there was no original intent to allow metadata in NaN-representations and JS/LuaJIT just made smart use of it. It’s always the question what you want your number format to be: Should it be able to contain metadata, or only contain information on a represented number? If you outright design the format to be able to contain metadata, you force everybody’s hand because you sacrifice precision and dynamic range in the process. If you want to store metadata on the computation, I find it much more sensible to have a record type of a float and a bitstring for flags or something. I see no context where outside of fully controlled arithmetic environments, where you could go with the record type anyway, one would be able to make use of the additional information.
Regarding your other point: Posits are not yet standardized and there’s some back and forth regarding infinity and NaR and what to use in posits, because you can’t really divide by zero, even though it’s well defined. I personally don’t see too much of an issue with having no infinity-representation, because from my experience as a numerical mathematician, an unexpected infinity is usually the same as a NaN condition and requires the same actions at the end of the day, especially because Infs very quickly decay into NaNs anyway. This is why I prefer NaR and this is what ended up in the standard.
The only thing I personally really need in a number system is a 100% contagious NaR which indicates to me that something is afoot. An investigation of the numerical code would then reveal the origin of the problem. I never had the case where an infinity instead of a NaN would have told me anything more.
To be completely clear: Posit rounding is defined such that any number larger than the largest posit is rounded to the largest posit (and the smallest accordingly). So you never have the case, in contrast to IEEE floats, where an input is rounded to NaR/+-infinity. Given +-infinity is by construction “not a real”, I find it to be somewhat of a violation to allow this transition with arithmetic operations that are well-defined and defined to yield only reals.
Dropping infinity also infinitely reduces the necessary complexity for hardware implementations. IEEE floats are surreal with all their edge cases! :D
The original intended use case for NaNs was that they should store a pointer to the location that created them, to aid in debugging.
But you lose some features. My vague understanding is that +/-0 and +/-inf exist to support better handling of branch cuts. Kahan says:
Yes, this was the post-justification for signed zero, but it creates much more problems than it solves, creating many many special rules and gotchas. If you do proper numerical analysis, you don’t need such things to hold your hand. Instead, given it’s totally unexpected for the mathematician, it leads to more errors.
It’s a little known fact that Kahan actually disliked what the industry/IEEE did to his original floating point concepts (I don’t know how he sees it today), and this is not the only case where he apparently did some mental gymnastics to justify bad design afterwards to save face in a way.
I had never heard of that concept. I want to share how I understand dividing by zero from calc2 and then relate that back to what you just shared.
In calc 2 you explore “limits” of an equation. This is going to take some context, though:
Understanding limitsTo figure out the limit of
1/x
as x approaches 1, you would imagine starting at some number slightly greater than 1, say 1.1 and gradually getting smaller and checking the result:But that’s not all. You also do it from the other direction so 0.9 would be:
The answer for the “limit of
1/x
as x approaches 1” is 1. This is true because approaching 1 from both directions converge to the same number (even if they never actually quite reach it). Wolfram alpha agrees https://www.wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+1But, limits don’t have to converge to a number, they can also converge to negative or positive infinity.
Limit of dividing by zeroNow instead of converging on 1, let’s converge on zero. What is the “limit of 1/x as X approaches zero”?
We would check 0.1 and go down:
And as it goes up:
The problem here is that coming from the top and going down the result approaches (positive) zero, and starting at the bottom and going up, the result approaches negative zero, which is a thing I promise https://en.wikipedia.org/wiki/Signed_zero. Since these values don’t converge to the same value, the answer is unknown it cannot be both answers, therefore division by zero (under this model) is unknowable and wolfram alpha agrees https://wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+0.
As a “well actually” technical correctness note. I’m explaining this as I intuit it versus in reality
Wheel1/x
as x approaches 0 goes to positive infinity and negative infinity. I know it has something to do with taylor expansion, but I’ve been out of school too long to explain or remember why. Even so, my explanation is “correct enough” to convey the underlying concept.As you get into higher mathematics I find that it feels more philosophical than concrete. There are different ways to look at the world and if you define the base rules differently than you could have different mathematical frameworks (somewhat like, but not exactly the same, how there is “regular” and “quantum” physics).
It looks like wheel said “negative and positive infinity aren’t different” which is quite convenient for a lot of calculations and then suddenly 1/x does converge to something when it goes to zero.
“As a “well actually” technical correctness note”: if one is working in the real numbers, which don’t include infinities, a limit returning infinity is more strictly called “diverging” to positive or negative infinity. A limit can also “diverge by oscillation”, if the value of the expression of which the limit is taken keeps changing for ever, like sin(x) as x tends to infinity.
What this says is that the limit of 1/0 is not defined, not that 1/0 itself is undefined. Consider sin(x)/x. If you do the math, the limit as x→ 0 is 1, but
sin(0)/0 = 0/0
.Another interesting example is the Heaviside function,
x < 0 ? 0 : 1
. Limit → 0 from the left is 0, limit from the right is 1, so the limit doesn’t exist. But the function is well defined at 0!You can express limits as approaching from a direction though, can’t you? So you can say that
lim -> +0
is 1 andlim -> -0
is 0. It’s not that the limit doesn’t exist, but a single limit doesn’t exist, right?Why is this all way more fun to think about now than when I was taking calc 1 and actually needed to know it?
Yeah, that’s right!
Another weird limit argument is that the limit of x/x as x goes to zero is 1.
To me, a more noteworthy one is that the limit of
pow(x, x)
asx
approaches zero is 1.x/x = 1
for almost all values ofx
, butpow(x, x) = 1
is much rarer.At least from a cursory reading of the wikipedia page, you have basically traded
x/0
being defined for0*1 != 0
, andx/x != 1
.Yeah, you can also make a mathematical structure where division by zero is five. It’s not very useful though.
There’s one additional problem with Optane: because it relies on cache flushes for persistence, it isn’t cache coherent. This means that you’re effectively limited to doing writes from a single core to any given cache-line-sized block of Optane memory. Building anything that defines a total ordering using the interaction of the CPU’s memory model and the flush-based model for NVRAM is an open research problem and a lot of papers are published that do it wrong. Expecting normal programmers to get it right is a fantasy.
I think that the article deals with that, doesn’t it? In the section titled “Persistent Memory and Caches”.
It concludes:
I read that section. It talks about the need to flush the cache and that being a serialising operation on the current pipeline, but it doesn’t mention anything about shared memory between cores.
Ah, OK, ISWYM then.
Hence universal constructions, so we only have to get it right once.
I would guess the main reason this doesn’t see more use (I have in fact heard of it before, though I don’t remember where) is that it’s not general and it inhibits debugging. The classic approach to constant propagation, using abstract interpretation, has neither of these issues.
I think libFirm does some optimization passes implicitely during conversion from AST to SSA.
That said, I agree that it makes generating error messages more difficult for the compiler.
This is good to hear. FreeBSD hosting OSes (via bhyve) and database servers, seems like a good contender to Linux.
For me a recent effort to use FreeBSD in my (somewhat specialized) workflow had failed.
But overall glad to hear FreeBSD is making progress not just as an OS, as a community ( both open source and commercial).
That (SMBFS 2.0 and 3.0) is one of the topics the Enterprise Working Group [1] will try to address in its works.
In the mean time - You can run OpenSSH server on that Windows 11 box and mount shares on FreeBSD using sshfs(1).
You can use Android Studio under Linux Binary Compatibility but I did not checked how well these instructions work:
… but yes, these are only workarounds for things that ‘just work’ on Linux.
[1] https://wiki.freebsd.org/EnterpriseWorkingGroup
Regards, vermaden
I’ve been using FreeBSD as a DB server for a really long time. Postgres has a couple of knobs, so it doesn’t do the things ZFS takes care of. Then you can fully rely on it, even for high traffic applications.
Or if you want to squeeze out every last bit you go with UFS. But I haven’t done that in a really long time.
Apple’s CoreFoundation library makes extensive use of low-bit pointer tagging. This allows it to store small integers (and maybe floats?), booleans and short strings directly in a pointer without allocating memory.
The encoding gets pretty complex, especially for strings; IIRC there are multiple string encodings, one of which can crunch characters down to 5 bits so it can store a dozen(?) characters in a 64-bit pointer. This sounds expensive, but I assume there are SIMD tricks to speed it up, and it’s still going to be way faster than allocating and dereferencing a pointer.
Any reference on the 5 bit encoding? This is in APIs that get called from Objective C or Swift?
The low bit tagging seems like an obvious win and portable win these days, although I know Lua avoided it because it’s not strictly ANSI C. Rust also got rid of small string optimization early in its life, apparently due to code size
https://old.reddit.com/r/rust/comments/2slcs8/small_string_optimization_remove_as_mut_vec/
Though honestly I would have expected fewer heap allocations and cache misses to be a win for most string workloads
You got it — the reference I remember is from Mike Ash’s blog, which has been dormant for a few years, but the archives are a treasure trove of low-level info about Apple’s runtime.
The CoreFoundation types are exposed in Obj-C as NSString, NSNumber, NSDate, NSValue. They also show up in Swift for bridging purposes, but the native Swift string and number classes are implemented differently (in Swift.)
The various tagging schemes that objc (and by proxy swifts interop) uses are internal implementation details that can (and have) changed so it’s not API. Instead objc_msgSend and family handle it directly - similar to the myriad refcount stores and what not.
I was actually looking at Mike Ash’s blog this week for info on tagged pointers:
https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
I was actually looking at the blog because I was wondering if ref counting in the unused 16 bits of a pointer might be feasible. It would give you up to 65k references, which is more than enough for many (most?) use cases. That would slim down the size of ref counted values and might make them as cache friendly as a GCed value. Might not be thread safe though.
Wow, skimming the rest of the post, this is a lot more subtle than I would have expected, and also relatively recent – OS X 10.10 as of 2014.
Pretty crazy!
I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.
For what I’m doing, we have to call into libc more, and pass it C strings, so we don’t control that part. The allocation to decode and make it look like a C string is problematic. Not just slow, but creates an ownership problem.
It makes me wonder if that is a headwind for adoption to Swift on other platforms. Is the language so tuned to performance on a single platform that it makes the code difficult to port?
It seems like they are also doing some pretty intricate things with C++ interop. As a outside observer (and a relatively ignorant one at that), it seems like it would be very difficult to generalize some of this work.
This stuff is part of CoreFoundation, not Swift. Swift on Apple platforms has some pretty sophisticated interop / FFI with it, for compatibility with Objective-C code, but that isn’t present in Swift on other platforms.
It shouldn’t be hard to port. In GNUstep, we adopted a compressed strings in pointers encoding some years before Apple (not the 5-bit one. Doing that well requires some analysis of data that I didn’t have access to from run-time profiling of a large set of real-world applications). The interface for iterating over strings allows the caller to provide a buffer. These strings are, by definition, of small bounded length and so converting to a C string for interoperability is trivial with the caller providing the buffer on its stack.
It does have some very nice performance properties. A lot of dictionaries use small strings as keys. If you check the length on the way in and try to convert mutable strings used for lookup to small strings then you know that the result either is or isn’t a small string. This lets you skip a bunch of comparisons after you’ve found the hash bucket.
Not really. In fact, Swift on Linux doesn’t have certain complexities that are present on Apple platforms due to the necessity of ObjC interop.
I haven’t followed Swift closely, but I do have the feeling that the focus is entirely on Apple’s platform, and there are some fairly hard tradeoffs with respect to portability / open source.
Just like I think of Google’s entire stack as a vertically integrated embedded system (hardware up to cloud and apps), Apple seems to be architected in a similar way. They get some pretty big benefits from the vertical integration and control
Andreas Kling mentioned that Apple is a huge inspiration for SerenityOS – basically doing everything yourself and not taking dependencies, which is kind of the opposite of most open source, which is about reuse and portability
It seems like if Swift were going to be popular on Linux or Windows, that would have already happened by now. Looks like it’s about 9 years since the first release now
You can’t put the ref-count in a pointer, because a pointer is a reference. If you increment a count in a pointer, that doesn’t affect any other pointers to the object, so no one else knows you added a reference.
CoreFoundation does (IIRC) store refcounts outside objects. My memory is hazy, but it might reserve a few bits for an internal refcount, and when that pins at its max value, the real refcount moves to a global hash table. The answer probably lies in Mike Ash’s blog. (Swift-native class objects don’t do this, though.)
[Update: just saw your other post below. What CF does uses spare bits in a pointer field inside the object; I thought you meant putting the refcount in pointers to the object.]
No, you were right in the first place. I was totally thinking about things wrong. I got wrapped around the tree a bit. Thank you for the correction.
It was this one - I added a link to it this morning after this thread!
The refcount for a normal objc object is kept in a few points of the isa pointer, and then moved to a side table once the refcount exceeds the available bits. The result is that there’s no major memory overhead to the refcount in normal objects.
I did see that in this article from Mike Ash’s site:
https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html
It made me happy to see that it wasn’t a totally stupid idea on my part! :]
Yup. Plus the WebKit “Gigacage” and the v8 (Javascript runtime) uses this mechanism for sandboxing. I hear another browser engine is considering something similar.
On Intel, the 5 bit encoding can be optimized by using
pdep
to expand each 5 bits into a full byte, andpshufb
to do a SIMD table lookup. I don’t think Arm has something likepdep
though.Pdep is a bit overpowered. You can also do it with a multishift, though arm doesn’t have those either…
I don’t know much about ARM, but I guess you could:
Storing tags in a few low bits isn’t really what this is about I think, as changes to total address space (adding high bits) don’t really impact it.
The article did talk about it, though.
It’s not what led me to start writing the article but it’s definitely relevant. It’s also probably used way more in practice than putting data in the upper bits. Thanks @snej for the additional note - I actually came across some of the Mike Ash blog posts last night too. I’ll add a reference to them in.
For CHERI C, we found a lot of cases of storing things in the low bits (not least in LLVM’s compressed integer pointer pair template). This is a big part of why we support (on big CHERI) out of bounds representation beyond the end of a pointer’s range: so you can store data in the low bits of a pointer to the end of an array (you can’t dereference it without bringing it back).
Most (not quite all) of the places where things stored dat in the top bits, they were creating tagged unions of a pointer and some other data and rarely storing anything in the top bits other than a ‘this is not a pointer’ pattern, which can be replaced by the tag bit on CHERI (giving a full 128 bits of space for other data, something I started to explore but didn’t pursue very far and which I expect to be very useful for dynamic languages). A small number of things stored a a pointee type in the high bits.
Morello separates the value field of the capability into address and flags. With TBI enabled, you can store data in the top 8 bits. I believe this was mostly done to allow MTE composition later. It has some fun implications on arithmetic.
That it is not common practice to ship debug symbols with production binaries is, in my opinion, quite sad and insensible. One argument is obfuscation. That never stopped any reverse engineer. Another is binary sizes. My bin directories are a few piddling gigabytes; they can bear to grow a bit in exchange for a better user experience.
Agreed.
Unfortunately this doesn’t apply to everyone. At $WORK, stripping all of the binaries contained in the package we ship to customers results in saving almost a gigabyte. Our customers sometimes try to download our product from behind awful corporate firewalls/antivirii that will randomly interrupt downloads, so shipping the tiniest package we can is very important. A couple of my remote colleagues are also on terribly slow networks where that extra gigabyte would be a real pain when attempting to bisect the whole package itself.
Another example was Debian’s firefox-dbgsym package, which IIRC was also a multi-gigabyte monstrosity. Clearly not something that should be shipped to everyone.
Was coming here to say basically this. Storage is cheap, but transfer can be expensive, and time can be very expensive. We made our life much better at $WORK last year when we split our software into data and code and turned a 2.5 GB download into a 2 GB download that seldom changes and a 0.5 GB download that changes all the time.
You may want to experiment with stripping debug sections and debug symbols but keeping other symbol names. In my experience, that results in a binary that’s almost as small as a stripped binary, but you get function names rather than byte offsets in stack traces. It’s not as nice to work with in a debugger as a binary with full debug info, but it’s much nicer than a completely stripped binary.
With the
strip
tool, use--strip-debug
to strip debug symbols and debug sections but keep other symbols.In what way is the user experience better? At best, users get better stack traces when things crash, but if you’re recording those stack traces for debugging and know the package version then you can symbolise them later. Unless you’re a developer, you are unlikely to have any use for debug symbols. If you are a developer then you can
Debug symbols are usually 2-10 times larger than the whole of the rest of the binaries. This doesn’t just impact disk size (though it has a huge impact on container start times) it also impacts downloads. You may have FTTP, but for a lot of users, downloading an extra 10-100 MiBs of debug symbols for each package when they update is a big impact.
I just had a look at a FreeBSD system (FreeBSD ships with separate debug info). The /usr/lib directory on this system contains 9.9 MiBs of shared libraries, 27 MiBs of debug info. That’s three times as much disk space and network bandwidth consumed to distribute the debug symbols as the libraries. I’m very happy that that’s an optional install so I can install it on dev systems but not production ones.
I don’t really care about debug symbols, and I wouldn’t want to need +700% of additional disk space used for something I will never use, and without which everything works perfectly fine:
In rare cases where something crashes, I can download debug symbols later (if they’re available), in case I need them.
This is a protection based on cost to reward ratio. It seeds out less determined reverse engineers, but doesn’t affect more determined ones. So it does stop some reverse engineers, just not everyone. It increases the cost of reversing to a point where some of reversers bail out, because cost is higher than the reward.
How often do users need the debug symbols? My educated guess is “almost never”. We complain about bloat all the time, why bloat things further with unnecessary things, then?
Making debug symbols easily and readily available benefits both the end-user (who doesn’t need, nor want them), and the developer (who does, and can install them trivially) is, in my experience, a much nicer practice, with almost the same benefits, at the fraction of the cost.
I don’t. And I don’t find that it is ‘trivial’ to install debug symbols (case in point, else-thread: ‘it seems that
packed
is currently unusable on Linux’). Something l have noticed other people say they value is open-source contributions; wouldn’t it be a great help if it were as easy as possible to debug, modify, and contribute to open source software?If it were a simple matter of changing a setting in my system package manager’s configuration file, to get debug symbols, source code, and dirty build artifacts—with the default being as it currently is—that would be one thing. But this is not a feature anyone implements, as far as I know.
Then that should be fixed, rather than shipping debug symbols by default, which would be useless 99% of the time.
Debian has a debuginfod server, which gives you trivial access to debug symbols: all you have to do is either have
elfutils
installed, or export a single environment variable (more information). While making source code similarly easy to access isn’t a thing yet, that’s also simple to automate, and it’s usually anapt-get source
away.It’s a bit more work, yes, but you won’t get gigabytes of near useless stuff by default, and that’s a big win in my book.
Don’t forget about
apt-get install somepackage-dbg
! (which might require enabling the debug symbols repo? can’t recall.)Debugging software from Debian repos is a pretty painless experienc. I’ve been much more willing to debug minor bugs in random software since switching to it.
I agree with the sentiment. It would probably make more sense to have the common practice be that package maintainers release software exactly as is described in the article: the stripped binary, and an addon containing all of the debug symbols. Best of both worlds!
I wouldn’t normally post video content featuring myself but this video was particularly well received.
Since it was published, people pointed out two mistakes I made:
Go has been able for a while now to export dynamic libraries. My knowledge was from before that time and I also got confused, thinking that you could not export C ABI functions at all, while in fact you can. That said, having a runtime still makes Go not a viable C replacement in the most direct sense of the expression.
Zig used to only support pointer arithmetic by converting the pointer to an int, applying the operation, and then converting it back to a pointer. Since a few months ago,
[*]T
(and related) started supporting arithmetic. That’s a pointer type that you don’t touch directly often, as you normally would use a slice (ptr + len).What you mean is that Go has a garbage collector.
C has a runtime. It is called “the C runtime” and is traditionally abbreviated as “crt”. On Linux systems with GCC installed, there are files named “crt*.o” somewhere under /usr/lib that are part of the C runtime. This is distinct from and in addition to the standard C library (libc). If I compile the C program “int main() { return 0; }” using GCC, then I get about 2K of code and data, even though I’m not calling any library functions. This 2K of stuff comes from the C runtime. [However, note that I’m producing a dynamically linked executable. If I try using ‘gcc -static’ then I get an executable with 780K of code (it looks like glibc), and I don’t know how to make that smaller.]
Rust also has a runtime, even though the Rust-lang.org home page claims that it does not! If I compile the rust program “fn main() {}” (which references no library functions) then I get a static executable that is over 300K, and that’s due to the Rust runtime. Supposedly most of this is due to the standard panic handler. Here is some documentation about the Rust runtime: https://doc.rust-lang.org/reference/runtime.html, which says that the panic handler is part of the Rust runtime.
Zig seems like the best choice if you want to build static executables with a minimal runtime. I compiled “pub fn main() !void {}”, and got a static executable with 660K of code and data. Twice the size of the corresponding Rust executable. A lot of this runtime code seems to involve runtime safety checks and a panic handler. If I rebuild using ReleaseFast then I get 190K of code, which again includes a panic handler. If I rebuild with “zig build -Doptimize=ReleaseSmall” then I get a much smaller static executable with only 6K of code. I don’t know how to make C static executables this small (on Linux).
The greatest trick Unix ever pulled was convincing its programmers C doesn’t have a runtime.
I wasn’t sure you were THE Calvin, until now.
“The”?
There was a famous Calvin who spent a lot of time worrying about what tricks God was pulling.
I believe he was more interested in whatever tricks the Catholic Church was pulling to keep the Word of God from everyone. So quite apropos to the comment from our calvin.
Wait, does that mean Hobbes is here too?!
It also has green threads and a multiplexed IO library.
C only really lacks a runtime when it is used in freestanding mode, when there’s no stdio nor stdlib: no allocator, no signals, no main(), no exit().
yeah, I really don’t understand people that think it makes sense to downplay the fact that a language like Go can pause execution, realloc an entire green stack somewhere else and fixup all pointers, while being really fixated on crt.
This answer is like “dry cleaning actually uses liquids”. You’re correct in the strict sense, but also ignoring everything people mean by “having a runtime” in the common-yet-imprecise sense.
Runtimes of C and Rust (and probably Zig’s too, although I’m unsure about their async) are relatively small, non-invasive, and play nicely with other runtimes in the same process. These languages can produce static libraries that are easily usable in programs written in other languages. That’s generally not the case in languages that are said to “have a runtime”, in the sense that the runtime is substantially larger, more involved in execution of the program, and may cause problems if it’s not the only runtime in the process (e.g. if it needs to control all I/O, or track every pointer).
That’s due to the
std
library, which is linked by default if you’re compiling for a hosted target. It’s not part of the Rust language, which is why people say Rust doesn’t have a runtime.A Rust program that just prints hello world is about 9K:
The Rust Runtime Environment is entirely optional. You can in-fact compile a Rust program that does not reference any of std, alloc or core. You will be stuck with a very restricted environment (similar to what happens if you do this in C).
It should also be noted that when you simply compile a Rust program, the stdlib isn’t LTO optimized or otherwise shrunk down (loadbearing * here). You can disable that and only bring what you need. You can also disable the startup wrapper which handles some early init stuff, you can remove the default panic handler entirely and even disable the OOM handler.
Additionally, running in no-core mode will require you to implement a few core constructs the compiler is looking for yourself, since you’ll be missing quite literally everything that holds rust together (such as operators).
tcc is pretty good at producing small executables from C code
It sounds like you are describing gcc, not C in general.
Windows also has a C runtime – and worse, it was not distributed with the operating system!
It was called
msvcrt.dll
as far as I remember – Microsoft Visual Studio C runtime. I remember you had to copy it around to get some programs to work.This was over 15 years ago – not sure what the situation is like today.
edit: To clarify, C does have a runtime, but you don’t have to use it. Kernels and Windows user space don’t, but Linux user space does.
printf()
is part of the C runtime, but the kernel doesn’t use it. It has its own string formatting routines.Another consideration is that modern malloc()s have a lot in common with garbage collectors. They have some unpredictable performance characteristics, similar to a GC runtime.
That’s a fairly typical way for C compilers to link in the startup and exit code.
Recognizing that this is a tangent, how many C compilers are in your dataset to judge what’s “typical”? Do clang, tcc, kenc, cproc, lacc, and scc do this too?
I don’t know about hobbyist C compilers, but what do you think calls
main
? Acrt0.o
that contains things like_start
is pretty standard for mainstream Unix C compilers.Well there are only 2 mainsteam Unix C compilers. So I guess by “pretty standard” you mean “universal”?
I’m not sure what you’re trying to say here. Are you implying that e.g. tcc, kenc, cproc don’t have any startup or exit code?
C programs expect to get
argv
, havestdout
, andatexit
working. These things are part of C and its standard library, and compilers need to insert code that makes them work.I’m asking if @calvin is referring to the two mainstream Unix C compilers, gcc and clang, in which case his statement could be strengthened from “pretty standard” to “universal.”
Well, to be more precise, the crt (aka csu, C startup) usually belongs to the platform (libc) rather than the compiler.
So what were you saying is fairly typical?
I’m describing the situation when you use C to write programs that run under an operating system like DOS, Linux, Windows, MacOS, etc. If your C program has a function called
int main(int argc, char **argv)
, then there is a C runtime that provides the operating system entry point and callsmain
. The ISO C standard calls this a “hosted execution environment”. The situation where C might not have a runtime is called a “freestanding environment”.Thanks for clarifying the meaning of “runtime;” I was not aware it included things like startup and malloc.
Does zig not model pointer provenance than? There has been a lot of discussion in the rust community how about pointer/int cast break the compilers ability to reason about provenance. As I understand it, you can have either pointer/int casts or pointer provenance, but not both.
That is not quite the case. There are several proposed provenance models for Rust, C and C++, all of which have to have some way to deal with integer/pointer casts.
This paper gives a good overview: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
tl;dr there are two categories of approaches: “PNVI” (provenance not via integers) and “PVI” (provenance via integers).
In PVI, integers carry provenance. So if you cast a pointer to an int and then back, you retain the entire chain of custody and the resulting pointer works exactly as the original did. However you now run into some tricky questions like “What is the provenance of
a + b
? Ora ^ b
?”. This is (from what I can tell, but I’m no expert on this) what CHERI C does: Theiruintptr_t
retains provenance, and they make some choices about what that means for integer math.In PNVI, integers do not carry provenance, so you need another way to make pointer->int->pointer casts work. The current favorite seems to be “PNVI-ae”, for “address exposed”, where casting a pointer to an int leaks its provenance to “the environment”. Casting back to a pointer will look for an exposed provenance for that address and if there is one, you get that (and an invalid pointer otherwise). This avoids the tricky questions of PVI and allows a bunch of patterns that PVI doesn’t, such as the infamous XOR-linked-list. However, it is also extremely painful and slow to implement on platforms that physically manifest provenance such as CHERI. For regular optimizing compilers however, it’s not a big problem: their aliasing/provenance analysis already has to be able to cope with pointers that “escape” their analysis horizon, be it due to FFI, inline asm or (and yes, this is literally called out as allowed by the C standard) round-tripping pointers through the filesystem via
fprintf("%p")
andfscanf("%p")
. PNVI-ae at worst inhibits their ability to optimize code that does a bunch of int/pointer casts.Now for Zig, if they adopt these rules as-is, this might mean a more substantial pessimization if int/pointer casts are more idiomatic and used in many more places. If more or less every pointer’s address is escaped, you do lose most of the optimizations allowed by provenance.
There is no “Zig Memory Model” yet from what I can tell, but a bunch of discussion: https://github.com/ziglang/zig/issues/6396
Note for those unfamiliar with CHERI: this means uintptr_t is 128 bits large when holding a 64 bit address.
Also, for languages that don’t need to be backward-compatible with code that casts freely between ints and pointers, there is the option to disallow int->ptr casts completely, and instead require supplying provenance explicitly when you want to construct a pointer from an integer. E.g.
new_ptr_with_derived_provenance = old_ptr.new_from_address(some_int)
.‘When all you have is category theory, everything…’
“…is an object in some category,” yes. Also, when all you have is a deductive system, you have a category. When all you have is a directed graph, you have a category. When all you have is an FSM, you have a category.
The original insight that led to this note was from Zelda 3. To traverse a superroom in that game, we traverse some of its rooms. To traverse a room, we start at a cutscene called the door animation, and end at another door. Rooms naturally form a graph and speedrunners rely heavily on the topology of that graph, particularly when playing randomized room layouts. So, one day when I was playing a randomized copy of Zelda 3, I thought about how a TAS script might automatically discover routes through the randomized graph, and realized that there is a complex interplay between the room graph and the item layout.
An odd comment. On aarch64, a xor-selfie is architecturally not allowed to be dependency-breaking. (Not technically true, but close enough, and unlike x86 there’s no reason for the microarchitecture to treat it specially.)
Another odd comment. Because:
More generally, I found the paper somewhat shallow. I would have liked to see an exploration of where the differences come from. The applications they considered are also quite domain-specific.
I completely agree with the four kinds of optimisation, though the first and second are often intertwined. Great discussion on the difference between best case and common case on your data - that’s often critical and is largely overlooked in university classes.
Option 4 is the most interesting one. Apple’s FSEvents mechanism scales a lot better than Linux’s inotify because it tells you only the directories that have changed, not the individual files. You can then look at the modification time of the files to see which have changed since your last epoch for that directory. This also has the advantage that it does some coalescing for free. I’d be inclined to subdivide these into two categories:
FSEvents is an example of the fist case. You don’t care which files are modified because traversing a directory and looking at the modification times is cheap (or, at least, less likely to cause performance problems than blocking writes until an event notification consumer has caught up). In contrast, using lower precision floating point for coordinates in a game because users won’t notice if an object is less than one pixel away from its ‘correct’ position is an example of the latter. The first has some overlap with choosing a better algorithm.
I think there’s a higher-level category, which is ‘reevaluate your requirements’. Some of the biggest performance gains that I’ve made have come from applying the techniques in the article to a system that was already carefully optimised. Between the time that it was written and the time that I’d looked at it, the requirements had changed significantly. Sometimes this applies to the input changing as it interfaces to a different system (you built a thing for processing SMS, now it’s used with WhatsApp, suddenly short 7-bit ASCII strings are replaced with longer UTF-8 sequences). Sometimes it’s that the hardware changed (globals got slower, passing arguments got cheaper, parallelism became the big win, cache misses got really expensive, branch predictor performance became critical).
Indeed. This is also, as I discuss here, an argument in favour of abstraction: maybe low-level requirements can be changed without affecting higher-level ones at all.
Thanks! And I agree that “allow some incorrectness” might better be thought of as “relax your requirements”.
If you’re increasing the bit depth of the single greyscale value
1234 5678
, wouldn’t it be more accurate to turn it into1234 5678 7878 7878
(repeating the last byte) rather than1234 5678 1234 5678
? That’s what my intuition says, but I don’t have a formal argument for it.The three-value and four-value CSS hex color syntaxes are defined as repeating the first hex digit for each channel (RGBA). Color
#fa0
is equivalent to#ffaa00
, and#fa08
is equivalent to#ffaa0088
. There is no CSS syntax that turns two digits for a channel into more digits, though, so we can’t compare CSS’s design to my idea above.Why just the last byte, rather than, say, the last bit, or the last 4 bits? That seems quite arbitrary.
Consider what happens when we use your proposed mechanism to extend 3->6 digits (no meaningful difference between decimal and any other base here):
126 -> 126 666
127 -> 127 777; delta = 1111
128 -> 128 888; delta = 1111. so far so good
129 -> 129 999; delta = 1111
130 -> 130 000; delta = 1. uh oh
131 -> 131 111; delta = 1111
Now use the linked article’s mechanism:
126 -> 126 126
127 -> 127 127; delta = 1001
128 -> 128 128; delta = 1001
129 -> 129 129; delta = 1001
130 -> 130 130; delta = 1001
131 -> 131 131; delta = 1001
(Not so coincidentally, this mapping can be implemented as—or, rather, is rightfully defined as—a multiplication by 1001.)
1234 * 99999999 / 9999 = 12341234
In more detail: first of all, this is in decimal instead of binary/hex, for clarity. 1234 is a 4-digit decimal color, and we want to convert it to 8-digit decimal. Dividing by 9999 (the largest 4-digit decimal value) converts 1234 to a fraction between 0 and 1 inclusive. Multiplying by 99999999 (the latest 8-digit decimal value) converts that fraction to 8 digits. Though you need to do the multiplication before the division because integer math.
In CSS syntax, for RGBA, each digit in “#ffaa0088” is a hexadecimal digit (4 bits). The last byte is 2 of those digits.
In the article, for greyscale, each digit in “12345678” is a binary digit (1 bit). The last byte is all 8 digits. Repeating the last (and only) byte would be “12345678 12345678”.
Moonchild’s sibling comment is a good answer to the accuracy question. I wouldn’t have known myself! CSS colors are actually a good example of the proposed technique in action since each 4-bit hex digit gets expanded to an 8-bit channel intensity by copying. That could’ve been a nice lead to my article.
No comparison to copy elision / return value optimization, as far as I can tell. Is this really any better?
It supports dynamically-sized values like closures, vectors, strings…
The drawback I see is that the caller keeps the callee’s entire stack frame (not just the returned value) until it returns. And if it too returns a variable-size value, it doesn’t pop the stack either.
This seems like it could cause major stack bloat in some circumstances (especially with recursion…)
Can those handle variable-sized data?
No, but they can handle forwarding. If I have a deep call stack that returns an object from the leaf to near the root, this can be done without copies in C++ with guaranteed copy elision. The root caller allocates space, passes the pointer down the stack, and reach return uses the same space. This cannot support variable-sized structures, but supporting variable-sized structures via any non-heap mechanism would be impossible for anything other than a one-deep call. This is an unfortunate programming abstraction because outlining a piece of code changes the semantics. In particular, I’d your language exposes object identity then this will make a bunch of optimisations visible to the programmer.
Putting variable sized objects on the stack increases the amount of metadata that needs to be tracked. If your passing them up the stack with a bunch of copies and a load of metadata then you may well find that heap allocating is faster (if the objects address never escapes, you can delete it without adding it to the GC).
See: lazy allocation (by baker, referenced by the linked paper) and stack-based allocation.
That Azul deck is a fun counterpoint to Appel’s garbage collection can be faster than stack allocation
The stack-based allocation shown in the Azul slides doesn’t apply here — we’re talking about objects being returned from a function. Azul’s escape analysis (just like Go’s) would detect that the object’s lifespan exceeds the caller’s, so it would be forced to heap-allocate it.
Escape analysis is very simple and everybody does it. The slide deck describes a more sophisticated mechanism which can in fact cope with escaping objects.
Ah, you’re right. It just requires custom hardware to implement the necessary write barrier. That was in 2006 — I assume it never took off, at least not in any mainstream sense?
It didn’t take off, indeed. That’s not so much a technical indictment of it as an indication that external factors did not work out favourably. For instance, cliff click was also responsible for the c4 garbage collector, which azul uses successfully to this day, but which did not start to enter the mainstream until oracle decided to fund an oss clone of it a few years ago (zgc). There are various sociotechnical factors—pretty much all of them artificial—which disincentivise production of hardware gc support.
Like Haskell! Or Prolog!
Like Postscript!
Why not? Apart from IO, there are guards, that depend on ordering. The compiler can’t in general check if guards overlap.
If this is not execution order, what would you call it?
Well, the point of my comment is that SQL isn’t as special a language as the author seemed to think. In that sense, a couple exceptions don’t invalidate my point. I bet there are a couple places in SQL where subexpressions are guaranteed to be executed in some order too.
There’s a way of viewing Haskell where even
do
and guards don’t have “a notation of execution order”, though. The guard above is equivalent to:And
is defined to be equivalent to (not sure if I’m getting this exactly right, but you get the idea):
In both cases, the required “execution order” can actually be viewed as purely functional code with nesting.
Oh ok, I understand. I suppose that the assertion is that if expressions don’t have side effects, and most things are expressions, then the language mostly doesn’t have side effects. I guess I misunderstood it because I think of expression evaluation as embedding an order of operations. Which is apparently where I went wrong.
Another way to think of it is in terms of equivalences. The blog post gave some equivalences between SQL expressions. In Haskell, these expressions are always equivalent, for any subexpressions X and Y:
This is unusual! Very few languages have these equivalences.
On the other hand, haskell loses a lot of equivalences because of laziness and in particular the use of syntactic dependencies. (Fast and loose reasoning isn’t actually correct.)
I don’t know if this. Could you give an example of an equivalence lost to laziness? And what’s a “syntactic dependency”?
Suppose we define:
In a strict language, this commutes (assuming acyclicality), but in haskell, times Z bot is Z, but times bot Z is bot.
Ooh, interesting! Thanks for the example.
As someone interested in fast compilers, what are some good register allocation strategies for one pass compilers? I’ve been tinkering on an RV64 backend, but my register allocation is pretty primitive, as I don’t know the lifetime of variables ahead of time.
http://brrt-to-the-future.blogspot.com/2019/03/reverse-linear-scan-allocation-is.html
Ok, we can blame
random()
instead of the optimizer or the SQL language. But then what behavior should we expect fromrandom()
?It wouldn’t be useful to permit all of these, so how do we define it in a way that’s both useful and permits optimizations?
I think the argument in this post is dubious at best. Obviously the optimizations that a database query engine performs should be semantics-preserving in the same way that compiler optimizers are. In the case of an ON clause in a join, the semantics are that the expression is evaluated on the cartesian product of the rows from the tables that are being joined. In the previous article, SQLite and CockroachDB optimize incorrectly because they assume that a function that doesn’t depend on the contents of one row is effectively constant for any row from that table. This isn’t true for random(), but I don’t think there are many functions that are weird in a similar to random() so it’s not a good example from which to generalize like this article does. The date_part() example is not an example of a side effect: it’s a straightforward dependency on the contents of a row, which is the thing that random() lacked, so it won’t break the push-down optimization in the same way.
If you push down date_part you might evaluate it on rows that it would not get evaluated on if you do not push it down because the other side of the join might be empty.
Oh right, now I get the point about empty joins - I thought it was performance, I missed the effect on correctness. There are lots of partial functions … I am dismayed that the Postgres docs for mathematical functions don’t appear to say what happens if you divide by zero or take the logarithm of a negative number (maybe I am looking in the wrong place?). Anyway, I was wondering if they raise an error or return NULL. In the context of a join, it would make the planner’s job easier if errors became NULL; maybe there’s a wrapper function to NULLify errors.
They raise errors. You could easily write your own
safe_div
andsafe_log
functions, and more importantly declare themimmutable
the same as would be for PG-native versions, but in awhere
context you could even more easily check in another predicate before proceeding to divide or take the log since((x <> 0) and (n / x > 1))
only needs to fail the first branch.If the semantics aren’t defined (different databases don’t agree on the them), what’s there to preserve? This is essentially undefined behaviour. It’s not as harmful as UB in C, but very similar in that the optimizer can produce different results because the results you got the first time around weren’t guaranteed anyway.
Similarly, if you use
LIMIT 1
without an explicit ordering, you might get different results at different times, even with the same query and the same data because the optimizer might select a different plan (after aVACUUM
or anANALYZE
, for example).One could say these are cracks in the declarational facade that databases try to put up. The abstraction leaks and the underlying implementation/execution strategy shines through.
You can have a nondeterministic semantics. That is a perfectly fine thing to do. The specification describes a range of legal behaviours and intermediate states, instead of a single legal behaviour; and a well-formed program should ensure that, under all legal behaviours, it ends up in the same final state (or, rather, that all allowable final states maintain application invariants; for example, two newly added rows could get added in different orders and so have different ids, but that probably doesn’t matter).
I genuinely think it’s clearly defined as-is, per my long post elsewhere in the thread.
I guess my point there, that I should have made more explicitly, was that I think answering this question in a coherent way is incompatible with a declarative language. Or at least the ones we are stuck with today. I could imagine a language designed to have a sane answer to this question and still have a lot of the good properties that a language like SQL has, but I don’t know what it would look like and I’m not sure SQL is it.
Yeah, I agree random() doesn’t fit well into the language. :(
But I bet you could rescue it! You could find a definition that explains the behaviors people expect, but is still declarative.
For example here are some things I expect about random():
So you could define random() as a function that takes a whole bag of rows, and marks each row with a value such that the overall distribution comes out right. Or you could think of it as a function that maps each unique row ID to an independent random value–even though these dependencies aren’t visible in the user syntax.
Pretty wild read. It’s not often a GPU benchmark leads into electronics diagrams.
Most benchmarks are not very good!
My vague recollection is that windows divides cores into groups of 64 so that it can represent an affinity set with a 64-bit integer, so if you want to use more than 64 cores at a time, you have to do something special. If that’s true, then it is sort of the os’s fault.
Does anyone have a strong feeling about this actually improving security? I’m seriously skeptical, but I haven’t given it a ton of thought. Any exploit devs care to comment?
Well, the obvious answer is that it would lower the surface area a little bit, right? Instead of worrying about libc and syscall(2) you just worry about libc. Whether that improves security I don’t know but considering the resources OpenBSD has, it is one less thing to worry about or deal with.
I feel the security benefits are theatre, but it enables better backwards compatibility in the long run - Windows and Solaris for example, have had good dynamic library based compatibility for years. Of course, OpenBSD doesn’t care about that part…
If you can’t run software written in one of the most popular memory safe languages anymore, then maybe that would be bad for security.
The Go port will be fixed as it was before. How do you draw the conclusion that Go isn’t going to be supported? You didn’t read the post.
Right but I’m trying to understand if this mitigation would actually make exploitation of an existing vulnerability difficult. It feels like a mitigation without a threat model.
Go read up on ROP and stack pivots, especially on amd64 and variable length instruction architectures that make it impossible to completely remove ROP gadgets. There are very clear threat models already defined based on arbitrary code execution, especially remotely. Reducing the syscall surface area as much as possible minimizes the success probability.
I’m surprised no one has yet decided to use a separate stack for data, especially on x86-64 with more registers, a register parameter passing paradigm and a larger memory space, leave RSP for CALL/RET and use another segment of memory for the “stack frame”. That way, overwrites of the data stack won’t affect the return address stack at all. Given how fast 32-bit systems have largely disappeared on the Internet, I think such an approach would be easier (and faster) than all the address randomization, relinking, stack canaries, et. al.
Or (as painful as this is for me to say), just stop using C! Has anyone managed to successfully exploit a program in Rust? Go?
A similar feature is called “shadow stacks” - return addresses get pushed both to the standard stack and a separate return stack, and the addresses are checked to match in the function epilogue. Its supported in all the big C compilers. I can’t speak to how often it’s actually used.
Further afield, Forth also exposes fully separate data and return stacks. So it’s been done.
As far as performance goes, you’re losing an extra register for the other stack, which can be significant in some cases, and also memory locality. Cost varies but has been measured around 10%.
https://clang.llvm.org/docs/SafeStack.html
In addition to the safe stack / shadow stack work, it’s worth pointing out SPARC. SPARC had a model of register windows arranged in a circle. You had 8 private registers, 8 shared with the caller and 8 shared with the callee (you could reuse any caller-shared one you weren’t using for return and all callee-shared ones between calls). The first S in SPARC stood for ‘scalable’ because the number of windows was not architecturally specified. When you ran out, you’d trap and spill the oldest one (you should do this asynchronously, but I don’t believe the implementations that typeid ever shipped). This meant that the register spill region had to be separate from the stack. This gave complete protection from stack buffer overflows turning into ROP gadgets.
Pure software variants have been tricky to adopt because they’re incredibly ABI disruptive. Anything that creates stacks needs to allocate space. Anything that generates code needs to preserve an extra register (not just across calls but also when calling other functions). Anything that does stack unwinding needs to know about them.
If you’re compiling everything together and static linking, it’s feasible.
I know about the register windows on the SPARC, but I never really dove into how it interacted with the operating system with regards to context switches (process or threads)—it seems like it could be expensive.
Switching threads was very expensive. A large SPARC core could have 8+ register windows, so needed to save at least 64 registers. That’s fairly small in comparison with a modern vector extension, but still large.
On later superscalar designs, it actually wasn’t that bad. Modern processors allocate L1 lines on store, so spilling a full cache line is quite cheap. If you’re doing this often, you can even skip the store buffer and just write directly from registers to the cache line. I think switching from a thread required spilling all used register windows, but resuming a thread just required reading back the top and then the others could be faulted in later. SPARC had very lightweight traps for this kind of thing (and TLB fills - 32-bit SPARC had a software-managed TLB, though later SPARCs were spending 50% of total CPU time in that trap handler so they added some hardware assist with 64-bit versions).
I think the biggest mistake that SPARC made was making the register window spill synchronous. When you ran out of windows, you took a synchronous fault and spilled the oldest one(s). They should have made this fully asynchronous. Even on the microarchitectures of the early SPARCs, spilling could have reused unused cycles in the load-store pipeline. On newer ones with register renaming, you can shunt values directly from the rename unit to a spill FIFO and reduce rename register pressure. I think that’s what Rock did, but it was cancelled.
ROP isn’t a statistical attack, so this talk of probability is confusing.
Have a look at Blind ROP: https://en.wikipedia.org/wiki/Blind_return_oriented_programming
When you don’t have complete information of the running program, these automated techniques will operate with a probability of success or failure.
But nothing about this mitigation is unknown or randomized further, as far as I can tell. I don’t see how brop is important here or how it would be impacted by this. Maybe the attacker needs to be a bit pickier with their gadgets?
Any ROP technique needs to find and assemble gadgets. This would remove one possible type of gadget, making it harder to achieve arbitrary syscall execution especially in light of other mitigations like
pledge(2)
orpinsyscall(2)
.Assuming they do this to all interesting syscalls, it would make shellcode writing a bit more painful as now you actually have to deal with ASLR to find the libc versions. That said, ASLR isn’t a significant barrier in 99% of cases so its not going to combat anything targeted or skilled attackers. However it seems it would also disallow static linking libc, which is a huge con for such minor gain IMO.
Disclaimer: Its been almost a decade since I’ve attacked an OpenBSD machine on the job, so there may be additional protections im not aware of that make this change a more valuable protection.
FWIW, OpenBSD relinks libc (and OpenSSH, and the kernel) on each boot. So defeating ASLR on OpenBSD may require more than finding one offset.
This might be true in the browser (I have no idea), but in general it is not; you can run full-on c and c++ on the gpu, and fine-grained concurrency and coordination are an important part of some gpu algorithms (such as scans).
Related: Linux on GPU shaders https://github.com/mildsunrise/cursed_gpu_linux
ugh, so many footguns. I’m not a Go user, and the more I read about it the more likely I’ll stay away from it.
This didn’t even list my favourite Go footgun. Updates to fields of slice types are not guaranteed to be atomic and it is undefined behaviour to write to them concurrently from two threads (goroutines that are scheduled in parallel). If you do, a reader in another thread may observe the base of one slice and the bounds of another. If you are in an environment where you’re allowed to run arbitrary Go code but not the unsafe package, you can do this intentionally to escape from the Go memory safety model. If you’re not writing malicious code, the fact that Go’s type system has no notion of linearity means that you have to be very careful to not accidentally alias objects between threads and trigger this kind of thing intermittently, in a fashion that’s almost impossible to reproduce, let alone debug.
It’s a deeply special kind of language that manages to have a global garbage collector and still not be memory safe.
Holy crap that’s dysfunctional! Why don’t they wrap it in a critical section or something to avoid this kind of bug?
Because it would be expensive and they can’t be arsed, the Go people are fundamentally C people, it’s the fault of the developer if they fall into the language’s traps.
Slices are not the only affected type either, interfaces and maps also suffer from data races for sure. Generally speaking data races undermine memory safety and there’s a fair number of possible ways to get data races in Go: https://www.uber.com/blog/data-race-patterns-in-go/
Interestingly, just like java which has race conditions that can break the String class: https://wouter.coekaerts.be/2023/breaking-string
The Go memory model is that everything is thread unsafe unless you put it behind a mutex or a channel. Why would slices be different from everything else?
That something is not thread safe does not mean it breaks memory safety.
For instance as far as I understand while Java won’t protect the program from data races, the JMM guarantees memory safety in front of data races.
Okay, you can read a crap ton of writing by rsc if you want his take on this. https://research.swtch.com/gomm is the relevant part of a 3 part series.
That does not contradict my comment in any way. And the only thing of interest in that essay is that rsc is self-servingly contradictory:
is immediately followed by
which makes it completely moot.
I mean, it’s in the middle. It’s not classic UB nasal demons, but it could be a cause of data corruption. Maybe it’s a bad choice, but that’s what he chose and you can read his blog series to try to work out why he chose it.
I guess I’m just saying, it was a deliberate engineering tradeoff not “dysfunctional” because they didn’t “wrap it in a critical section or something”.
It’s funny how he first pays extensive lip service to Tony Hoare’s philosophy, especially this one:
only to then cheerfully update the docs to explain that you can get arbitrary memory corruption if you “misuse” the program (well, language).
Arbitrary memory corruption is waaay into classic UB nasal demon territory.
This is a nasal demon:
All loops must terminate. Therefore we can assume this loop must terminate. Therefore we can rewrite the loop to access
*p
or*q
before the loop happens as an optimization. (But what if it’s an infinite loop? Well, that’s UB, so we can assume it won’t.)Go is not any less safe than C/C++ and it specifically rules out some of the UB “optimizations” in C/C++ that give UB a bad name. So, slightly safer than C/C++, less safe than other languages.
I also think “wrap it in a critical section or something” is really breezing past how difficult this would be. Every slice/interface/map would need some kind of mutex or the type system would have to be radically different to prevent aliasing. You’re either talking about a huge GIL style performance hit or a totally different language with a much stronger type system.
I doubt it would be a “huge GIL style” performance impact - it’d be a mutex per slice, not a global mutex over all slices. There shouldn’t be much contention on these mutexes if you’re using it like “you’re supposed to”, anyway!
It seems even these days “it’s not fast enough” is still sufficient argument to avoid important safety features. Which is strange, because runtime bounds checking is part of Go. That’s also quite a big performance impact.
I guess it’s just a matter of time before someone opens a can of CVEs on some large Go codebases, and then we can have this whole discussion again.
Performance. Assigning a slice-typed variable is a common operation. If you had to acquire some kind of lock every time that you couldn’t prove non-aliasing then it would slow Go code down a lot. As @masklinn says interface-typed fields in Go are a pair of a pointer to the object and a pointer to the type, so it’s possible to get type confusion in these by racing and reading one type and the other value.
For maps it’s somewhat more excusable. A map is a complex data structure and updating a complex data structure concurrently is a bad idea unless it’s explicitly designed for concurrency. I think the map implementation used to be in C (special, Plan 9-flavoured C), but it might be pure Go now. If it is, then races there should just leave it in a broken state (just as updating any other non-concurrent complex data structure with data races can), rather than break the fundamental guarantees of the environment.
It’s far less excusable for interface pointers and slices because these are value types that are carefully designed to look like they are primitive values. You pass them around just as you would an integer. If two threads write to the same integer variable at the same time, one will win the race and you’ll see a value that makes sense. This is not the case with other Go types.
The depressing thing is that a type system that understands isolation can address this. When I write parallel code, there’s one rule that I want to follow: no object is both mutable and aliased between threads. Go provides absolutely nothing in the type system to help you spot when you’ve broken this rule. For a language that was designed from the ground up for concurrency, this is inexcusable. This is probably why most of the successful Go programs that I’ve seen use it as statically compiled Python and don’t use goroutines at all (or in a handful of very special cases).
I learned Go from the Go Tour back in ~2011 or so; IIRC, slices and interfaces were explained as being fat pointers or tuples, so I’ve always thought of them as such rather than thinking of them as integers. As a result, I’ve never really run into these problems. I’m very curious how often people are running into this? One of the things I like about Go is it’s pretty straightforward how things work, so you can intuit about stuff like this. I suppose if someone was writing Go like many people write Python or JavaScript–with no idea about the underlying machinery–this might get people into trouble, but on the other hand I don’t know how you can write Go without understanding some basics about memory layout, pointer traversal, etc. Maybe I’ve just been doing this for too long to empathize well with beginners…
How often is that? Go should be in a pretty good position to reason about aliasing.
I agree with you about the way Go should have… the thing Go should have done. But it would probably be more on-brand for them to fix this by designing atomic slices that avoid atomic operations until they are actually contended. Do we know if they’ve tried that?
Why? The type system does not give the compiler any information that it can use to make that kind of decision. If the slice is a local and has not been address taken, it can be assumed to be safe. In pretty much any other situation, the compiler has to assume that it can have escaped to a concurrent context.
I think they were very reluctant to introduce atomics at all, they certainly don’t want more. They want you to design code where objects are held by a single goroutine and you never do racy updates.
TBF in most cases slices are passed by value, in which case there is aliasing on the backing buffer (and there can be data races on that depending what it stores), but there’s no aliasing on the slice itself. Most issues would occur with slices getting captured by a closure or go statement in which case they essentially “had their address taken”.
A bigger issue, I would think, is that you’d need to add a tripleword pseudo-atomic which pretty much means you need a lock (interfaces are only a doubleword so it’s a bit better). And while in theory you could use the low bits of the pointer as your lock flag I’m not sure there’s such a thing as a masked compare exchange not to mention a sub-byte futex / mutex?
Why would you need such a thing? You can implement arbitrary RMW ops (on a single word) with cmpxchg.
Because if you want to smuggle the lock in a pointer you need to test and (un)set a single bit inside a value you don’t know.
Cmpxchg would require changing the structure of the slice to add a new member, at which point you might as well have a normal lock.
Sorry, I still don’t understand. I have used tagged pointers with CAS many times and I don’t see the problem. Find an unused high or low bit in the pointer, save the initial value of the pointer, mask and test the bit you care about, and if the test passed then set/clear that bit and CAS the old value to this new value. Depending on the context, if the CAS fails then either abort or keep retrying (maybe with backoff) until it succeeds.
Recent revisions of x86 and arm have fast two-word atomic reads and writes (avx and armv8.1 respectively). But more obscure architectures do not, so there are tradeoffs w.r.t. performance portability.
Sorry, for posterity, it is armv8.4.
Because accessing anything across threads is already undefined behavior, and your idea would murder the performance of correct code for no real reason. Writing correct code is in no way difficult, and if you do happen to slip up, that’s why your app has a comprehensive test suite, and why you run
go test -race
in CI, which puts everything into “I want to be painfully slow” mode, but bombs as soon as you have a single cross-thread access without a synchronization edge.If I want “arbitrary memory corruption”, I already know where to go for that. Do you really want memory-unsafety in a language that is marketed for externally-facing web servers? Java demonstrates that you can allow data races without compromising the integrity of the runtime itself.
I’ve been working with my current company and doing Go for about 9 years now. We’ve written several nontrivial services, a number of which handle more than 10k RPS. We’ve had zero production issues caused by data races, and… maybe two times in those nine years that someone wrote a race bug, which was caught by automated testing before it made it out the door. It’s not high on my list of concerns. The kinds of access patterns that could even theoretically run into this problem just don’t exist in our code, because people who understand the language have no inclination to write them.
I don’t know the specifics of this issue, but I do know that you’re not supposed to share a variable between go routines like that. If two go routines must work with the same data, you’re supposed to let them communicate it through a channel.
Whether that means it is OK to leave in a footgun like that is a different matter. But this is one of the many “problems with Go” that I somehow magically never encounter in real life.
I don’t see “so many footguns”. I see two things. The bit about
append
is something you learn in the first 30 minutes of using the language. And the other thing… I don’t even know. I can’t figure out what he thinks he’s doing there, or trying to prove, or what it might have to do with anything that might happen to anyone in the real world, because the presentation of the concept is basically nonexistent. And the “advice” about constructing slices is pure nonsense; there’s no justification given for it because none is possible.