I am fascinated at people who are otherwise security conscious thinking that Google would ever call them. All of these stories, the step 1 is always just such a red flag for me.
I don’t get how you fall for this yet also write up a whole thing like this with all the “right” details. Being able to do the latter seems like something that would preclude you doing the former
It’s like that Mike Tyson quote: “Everybody has a plan until they get punched in the face. Then, like a rat, they stop in fear and freeze.”. It’s just different when it’s actually happening to you, instead of you calmly reading someone else’s experience.
People get flustered, or they start panicking about the “crisis”, or their instinct to not be a nuisance kicks in, or they can’t effectively split their attention between the safety checklist in their head and what’s being said to them (which they fear might be a genuine problem).
A well-executed scam & scammer will also take advantage of anything they can to build trust. We see in this story that the author ran a number of checks, which the scammer passed to the victim’s satisfaction. That leaves the victim uncertain and vulnerable to the scammer steering them through key points where the victim could break off: the author thought about hanging up and calling back, and chose not to.
This is not the first time I’ve read somebody who should know better get (almost) taken in. A scammer can weaponize your intelligence and education against you, because the real vulnerability is the human factor.
It’s just different when it’s actually happening to you, instead of you calmly reading someone else’s experience.
Another thing that could happen to you is a temporary health-related effect on your cognition. I was successfully led by unfortunate combination of “almost recovered after some flu” and “making account on a new service that provides financial transactions”, but thankfully my bank’s anti-fraud system prevented the $1000+ transaction. Next day I was not even sure what I was thinking then, it was so f- clear.
You can also have side-effect from some medicines, not to mention alcohol.
Smart people can still be manipulated. I think these attacks get a lot of mileage out of the initial adrenaline rush you get when you think you might be being hacked.
Something that stuck out to me was that the author looked for reasons to buy the scammers’ story, rather than reasons not to. There’s a case ID, it’s from a google.com domain, the number is listed on a google.com page—but if you operate in that direction you stay on the scammers’ happy path, so to speak. I think we need to do a better job here teaching people to be suspicious.
But here’s the thing. If it had been FedEx calling you asking for import duty payment, they’d have behaved in exactly the same way, down to the obscure help pages and the dodgy URL you’re expected to type sensitive information into. I think any reasonable level of suspicion would force you to classify their communications as scam attempts.
The author mentions the “best practice” of verifying the phone number, but gets it wrong. You don’t verify the phone number. You call the company using a phone number you looked up yourself. (If you’re on a landline, call someone else first). The point here is not to verify anyone’s identity—the FedExes of the world impersonate scammers so effectively as to make that impossible—but to establish a new line of communication that you can trust from first principles.
Yes, caller ID is meaningless. The author actually got to the point of doing the right thing (calling back) but didn’t. It’s a good example of how everyone has blind spots — someone who knows how to check DKIM but trusts caller ID!
I think we need to do a better job here teaching people to be suspicious.
Which is kinda funny because in modern tech orgs the Security team is often seen sending out test phishing attempts to see if employees fall for it and then roll out mandatory security training courses for all employees, and this approach is apparently widely hated.
It’s hard to put it all in one category. That kind of training can vary widely in quality. It also depends on what follows - is it a result for the IT to work on, or a scapegoating opportunity for management.
You probably (and hopefully) don’t intend this as hurtful or shaming, but reactions like this are precisely why stories like these often remain untold, allowing scammers to stay under the radar longer.
Almost no one is inherently stupid but everyone does stupid things occasionally.
There’s a lot of psychological bias at play here. When it happens to you, it’s bad luck, or maybe you weren’t thinking clearly for some reason. When it happens to someone else, it’s easy to think they’re stupid. But I don’t think that’s always valid. I consider myself pretty intelligent, but I’ve done some pretty stupid shit – usually when tired, emotional, or distracted. Scammers are aware of this and exploit it by creating distractions. Telling someone their Google account has been compromised can push them into a panicked “I need to act now” state.
Sorry, I was too mean in the original post. Appreciate the message on this.
My thought was a bit more about how the specific knowledge that leads to you to do the whole “look up the phone number and ask them questions” things … it feels like if you would have the knowledge that Google would not call you in an unprompted manner, like ever. And perhaps “I almost fell for it” is stylized, given how they seemed to have so much suspicion from the get go.
And hey, “Google will never call you”… but …. maybe this one time is true! It’s an easy thing to think.
I am fascinated that I have on several occasions been phoned by Microsoft Support. The real Microsoft Support, no scam involved. I sent in a support question using their form, asking for how to do something with appsource, and I thought it annoying that they required my phone number and then not much later some guy with an Indian accent actually called me and told me that sorry we don’t have that feature and I apologize so very much but here are some workarounds. You’d think Microsoft would stop doing this kind of thing to avoid people getting tricked into thinking it’s them when it’s the scammers.
Our default human instinct is to trust that people are telling us the truth. Unless we are already primed, thoughts of deception are usually a secondary follow-on thought.
It’s a lot like cults. Everyone thinks “I’d never join a cult”, but the truth is that the potential exists in everyone. This is an always-open attack surface, for even the most security savvy tech person, and if we’re honest, the sophistication of these attacks has really ramped up in the last few years. The “herd immunity” has not really caught up.
The part I take exception is to the line “[I followed the best practices] of verifying the phone number”. The best practice is not to verify to phone number. The best practice is to is not take action on any contact you yourself have not initiated, and this difference is why this measure failed to protect him. If he had ended the call, called the verified number*, and then gotten agents to look up his ‘case’, that would have revealed the deception.
*(yes, I know that getting a hold of google is difficult)
One thing I’ve never understood about 4NF and I can’t find any good explanation: How is it not slower than using an array-typed column? When you want to find out what skills a given employee has, with an array-typed cell you can just look at the value, whereas with 4NF, you have to do a linear search over all pairs of (employee, skill). Tutorials always say “don’t worry about performance”, but don’t explain why.
The explanation is very simple: indexes. In the example above, the actual SQL table would look like this:
CREATE TABLE instructor_skills (
instructor_id INTEGER NOT NULL,
skill_id VARCHAR(24) NOT NULL,
PRIMARY KEY (instructor_id, skill_id),
KEY (skill_id)
);
This sets up two indexes (one index underlying the PK, and another index is explicit on skill_id) that together handle all the necessary searches/joins, and the primary key additionally handles uniqueness of each pair.
I hope @lproven will arrive with references but I am inclined to share my impressions.
The two earlier database models were hierarchical and network. I don’t have a good example of a “network” style but I believe COBOL’s records or the Mumps data structure are decent representatives of the hierarchical model. Even once relational databases started to exist, at the dawn of the PC era they were too heavy for practical use on small PC machines, so early consumer databases like dBase were heavily record-oriented. As a result, it wasn’t uncommon for “database” or “file” or “table” to be somewhat synonymous, and having structured or semistructured data embedded in a field happened frequently. The usual way of working with these early databases was to iterate records and do something with each of them. This is also the era in which the meme “joins are expensive” gets started.
I wouldn’t be surprised if a lot of weird non-relational ideas, such as those presented in the article as “non-4NF” relations, were carryovers from this era. The assumption wasn’t that you were writing SQL queries, the assumption was that you were iterating records and building a report by hand. A new file/table may have been very expensive or computationally difficult to work with, and even when that ceased to be true, the notion carried forward.
One of the first things we learned in RDBMS class was that tables are not inherently expensive, and joins can be cheap. You need to start with this mentality or you’ll produce really weird non-relational designs.
I wouldn’t be surprised if a lot of weird non-relational ideas, such as those presented in the article as “non-4NF” relations, were carryovers from this era.
There definitely seems to be some kind of “The Beatles are overrated” effect going on, where 4NF was so successful that it became the default, and it’s really hard to imagine the things it was a reaction to, and why anybody ever thought they were a good idea.
I’m not old enough to remember the time before relational databases, but I suspect there was a long time when a “database” meant a single table, and even when a database supported multiple tables, there might have been a time when having an entire table with just two columns felt frivolously small. I’ve definitely worked with CSV files that had almost identical data on consecutive rows with just one field changing, to represent a list of values. And I’ve worked with databases that put a comma-separated list of values in a VARCHAR field. Those structures must have come from somewhere.
I can’t substantiate this but I believe they come with culture created by databases being created as (literal paper ones, or digital) spreadsheets. Shoehorn that into a file, and you now have historical reasons.
I am not sure that’s entirely true. The way this weird composite design is supposed to work would not even look useful as a spreadsheet (or as a hand-written announcement on a piee of paper, or whatever). It is deeply weird.
I was thinking about how much effort it would take to just write the correct code that kept the data coherent in Kent’s five variations (mentioned in the article), and the amount of cases to handle is impressive.
This is of course purely intellectual game now, like Brainfuck programming, but I’m very much interested in dismantling this, because student’s time should be put to a better use…
I agree with the skepticism but will note how very many little extra data-coherency-preserving steps accumulate over the years in paper/simple-digital processes, the kind where a new hire spends weeks learning all the little process intricacies specific to that role.
To me that “convoluted design” looks exactly what you get when people computarize index card based physical archive storage. Physical archives that I have seen (granted, only one, but people maintaining it were experienced with multiple) were index cards with set amount of columns which contained potentially repeated values in some of them (think about key-value pairs with repeated keys). If card became full, new one was created which continued the “record”. When all of the data was stored in physical archives and digital databases were new and unknown thing, it does indeed make sense that first idea was to just replicate existing data as it is.
Now, digitizing that kind of archive is surprisingly difficult task. Data entered and digitized version of it (with OCR or by hand) will contain errors because of the human operators and because there isn’t hard database constraints. So going straight into schema with strict constraints is practically impossible. And when you haven’t seen or heard about strict schema databases, imagining such thing requires one to be quite a visionary and even then practicality of such thing sounds dubious at best.
I believe that the common advice of “always store time in UTC” is a bit overplayed. There are many cases where you actually need to store local time (AND a corresponding timezone name). Basically when you handle any time in the future. Only rarely you actually need “now + X * 86400 seconds”, most of the time it’s “October 10, 2030 at 20:00 German time”.
I would say that storing this as an integer: 20301010200000 would even make some sense because it will discard the time-related assumptions.
I rarely store timestamps in the future. I cannot recall when was the last time I needed that. And for times in future (like calendar events) I agree, but that is super rare case that you need to store such timestamp.
Leap seconds are another reason not to store future times in UTC, though it seems those might be finally going away. UTC always for past times is probably very reasonable, though.
UTC is for timestamps. When something actually happened.
Rarely for things like “expire in 24 hours”.
I mean we all understand how many times people were burned by timestamps stored in local time, so this overshotted advice seems to be kind of a valid precaution.
You can safely ignore leap seconds in basically all cases, e.g. UNIX time does not include them. A day in UNIX time is always 86400 seconds, and in practice clocks are adjusted around leap seconds with leap smearing, or just setting the clock back 1 second at the arranged time. If you actually care about highly precise timing, you have to use specialized libraries that are aware of leap seconds. Most things ignore them, and mostly 1 extra second of real time here or there doesn’t matter.
That 1 second matters even less when you consider that all clocks you have access to through conventional APIs are some degree of wrong—either non-monotonic, unsteady, or both.
That said, I agree UNIX time shouldn’t be used for future dates/times that have any calendar semantics.
The only way to avoid leap seconds is to use TAI internally, and keep a separate table for the UTC offset (kind of like timezones). Not sure what applications need second-level accuracy in the mid-to-long future, though.
Epoch time is just another representation of UTC, as far as I’m aware. So if you’re storing future dates then you have the downside described in the OP.
I worked on some open source package because I wanted some functionality for my own non-OS but not commercial project. But as I started contributing, I found a lot of friction in setting up the development environment. So much in fact that I felt that just powering through may be too stressful.
So I spent several months fixing the onboarding experience: upgrading to newer generation of packaging tools, adding proper diagnostics that are only relevant on a clean machine, fixing README’s, etc., etc.
In parallel I submitted my own changes and I think that both activities were reinforcing each other: I’m not sure if just my own changes would be accepted as readily as when I already demonstrated proof of effort.
My point is that IMO it’s a very good focus point for an experienced dev who is not obligated by deadlines etc. Work as a sort of icebreaker ship, reduce the barrier to entry somewhat, submit incremental changes. Anyone can add new features, lol; and fixing bugs is usually motivated by people who are affected by this bug. Fixing DX is often deficit activity.
Everyone who’s a big fan of these things talks about how you should spend a few hours (8-10 I think?), so I’ve spent about that much time really trying to get a feel for them and I think they’re mostly bad.
Occasionally there’s been some utility in using them to generate shell scripts or simple fragments of Python programs, but there has never been a net benefit in terms of time saved for me when you account for hallucinations sending me off on wild goose chases.
One thing I did realize recently, though, is that the hallucinations end up being way worse than wrong StackOverflow answers simply because of how much confidently incorrect supporting documentation gets generated alongside them.
On occasion I’ve pasted each version in sequence into a Gist and clicked save just so I can use the “revisions” tab in the Gist interface to get cheap diffs against them. I wish Artifacts had that as a built in option (same request for the new OpenAI Canvas feature, which also lacks visual diffs).
I think it basically comes down to having tools/technology that makes adding a new service very low-friction. And by that I mean a new service just gets a huge amount for free: database/queues are available, observability systems are available, tooling for deploying/scaling are available, etc. Scaffolding service boilerplate is just a single command.
There are pros and cons of this architecture of course, but I think the “complexity” argument that is commonly levelled against this architecture is heavily mitigated by using highly consistent technologies (and something we are able to retain by running migrations in the way I describe, where we step from one consistent state to another).
Are those 2800 different microservices or just instances of the same few hundred microservices? A simple search tells me that Monzo has about ~300 engineers, so that’d be roughly 10-to-1 services-to-engineer. I imagine not all of those 300 engineers are service maintainers either so it’s probably even more? How sustainable has it been in your experience?
Are those 2800 different microservices or just instances of the same few hundred microservices?
2800 distinct services.
10-to-1 services-to-engineer
I think that’s roughly correct.
How sustainable has it been in your experience?
I haven’t really found it to be a significant problem in practice. I’ve worked at a different company where I was responsible for fewer services (maybe ~10 for per team), but it was significantly harder to maintain because we lacked the high degree of consistency that Monzo services have.
Probably the biggest downside of this architecture is the fact that implementing transactions/locking is done at the application layer. But we have good off-the-shelf infra/tools that make this a lot less painful than I’ve seen elsewhere. On the flip side, product teams don’t have to worry about provisioning or maintaining any of the infra/storage themselves.
How do you approach testing flows that cross services? Eg: “Given that I change service C that is used in business flow where service A calls service B which calls service C; ensure the end-to-end functionality is not broken by
my change”?
The main technique we use is what we call “acceptance tests”: essentially a test that tests business logic against a whole system. These are run periodically in our staging environment as well as being triggered by deployments.
A technique we have for testing in production (and staging) is “shadow testing”: before enabling a feature in production, we’ll often run it in “shadow mode”, automatically comparing the responses with the actual responses and alerting if there is a discrepancy. We have libraries to make this easier to implement.
These are definitely all more clunky to implement and than unit tests, but they’re also not something we have nearly as many of vs unit test (higher up the testing pyramid), so we haven’t (yet) needed to improve the DX around this.
Also worth noting is that client code imports the generated protobuf sever code, so breaking API changes are caught at compile time.
how do you organize your OpenAPI wrt readability? How many endpoint do you have per microservice?
do you have unified OpenAPI UI that I can read through, or do I need to read it per-microservice?
I’d guess the median number of endpoints per service is around 10, but it’s a long tail distribution.
We use protobuf to specify the API. Clients of a service actually import the generated protobuf go code, this means we get compile time checking and also some nice developer experience like being able to “jump to definition” for RPC calls.
I read the title and remarked to myself, when I was at an investment bank they had more than 4500 services tied together with centralized message queuing. Not knowing the “monzo” brand I thought, what are these people up to that they have as many as a bank? stfw. Oh, they are a bank.
I’ve chatted to a few Monzo engs in the past and it basically boils down to blurring the lines between what is a unit of business logic - as willsewell mentioned, their tooling essentially abstracts away a lot of the complexities of inter-service communication in such a way that writing a package that imports another package and calls a function feels remarkably similar to writing a service that calls a function over the network.
Which is, given they have the resource to do that, pretty neat. And I use Monzo daily as my primary bank, it’s rarely let me down!
This is outside my area of expertise so maybe I missed something, but it kinda seems that this whole post is based on a flawed premise?
That is, the post seems to be assigning blame for timing leaks on C compilers, when AFAIU both the C language and mainstream CPU architecture documentation make no promises about whether the usual language constructs and CPU instructions have any guaranteed timing?
In other words, it seems to imply that timing leaks occur because C compilers introduce branches when AFAIU timing leaks can in theory occur for any reason whatsoever with almost any language construct and CPU instruction.
From a quick search, I found this document which seems to imply that Intel CPUs have a special mode for executing some CPU instructions with timing guarantees, but from what I understood, the approach outlined in the post doesn’t even attempt to use such guaranteed timing modes, instead trying to achieve the same thing with portable code, which seems like a flawed proposition to me…
To make a bad analogy, the approach in the post seems to me like trying to prevent a leaking boat from sinking by relying only on navigation…
when AFAIU both the C language and mainstream CPU architecture documentation make no promises about whether the usual language constructs and CPU instructions have any guaranteed timing?
IF(!) this is accepted as a given, natural and non-negotiable, then the question arises: how we, as an industry, are supposed to write cryptographic code in practice?
We have two major choices: first, accept that it’s impossible to write crypto code using the commonly available C compilers. Second, attempt to renegotiate with compiler writers their approach to introducing compiler changes.
An argument in favor of that is that “make no promises” part is not a law of physics but a decision made by people. Maybe they could reconsider that decision along the lines that the cryptography people suggest.
We have to major choices: first, accept that it’s impossible to write crypto code using the commonly available C compilers. Second, attempt to renegotiate with compiler writers their approach to introducing compiler changes.
I think that’s a false dilemma and that there is a third choice: introduce an implementation-specific (or even standardize it at the C language level) new special mode of compilation for certain C functions (say, enabled with a function attribute), where only certain language constructs are allowed, and which forces C compilers to use a special mode of compilation that is guaranteed to use the special data-independent timing CPU instructions such as those in the Intel document I linked to.
I think this is the most reasonable architecture for achieving the goals that the blog post implies they want to achieve.
It would avoid crippling C compilers in all the rest of (non-cryptographic) C code which doesn’t need those guarantees, but would still allow to write reasonably-portable cryptographic C code that is guaranteed to be free of timing issues.
At least in LLVM I think this would require some substantial new cross-language machinery beyond just the C frontend. The backend doesn’t keep track of constant time properties / data-dependent timing, and many of the issues aren’t C specific (the optimizations can also apply to C++, Rust, etc). For example somebody noticed a few years ago that the x86 backend decides when to emit branch instructions vs cmov based on a cost model, but that can introduce data-dependent timing if it picks the “wrong” one.
I think that’s a false dilemma and that there is a third choice: introduce an implementation-specific (or even standardize it at the C language level) new special mode of compilation
That route would mean one of three things that I can think of:
Users must activate the relevant compilation flag. Sorry nope, as a library writer that ships C code I cannot rely on my users to use any specific compiler flag.
Library writers must use the relevant attributes, and of course I cannot rely on my users sticking to a compiler that supports those attributes (okay I can, but then I can kiss portability goodbye).
We must all wait for the standard to get its wits about it, and give us what we actually need here. Let’s say 2 years for the standard to come out (2026), and ten more years to become widely adopted (2036), and ten more years to finally achieve hegemony (2046).
This is pretty bad all around.
It would avoid crippling C compilers
Define “crippling”. How much performance would actual programs be leaving on the table? Can you, or anyone else, say confidently enough that the amount is anything beyond utterly negligible? It’s easy enough to imagine measurable performance increases in some micro-benchmarks, but it’s not clear to me that the authors of those “optimisations” have realised that they were increasing the load on the branch predictor, and could very well be increasing branch misprediction as a result, not to mention the possible increase in energy consumption. A tiny effect for sure, but one that might very well offset the benefits.
What’s actually crippling is letting questionable optimisations significantly reduce the applicability of the language. And no, you don’t get to blame the victims with “but but but the standard never said you actually could”, when in reality they successfully did, and compilers started to take that away.
In the end, that’s probably what will drive me away from C. There’s only so much undefined behaviour I can tolerate, only so few guarantees I can survive off on, before I decide that Clang and GCC have gone crazy and I start to run away from anything they touch — which pretty much means all of computing these days, but I’m already tempted to bootstrap everything from FPGA & open source synthesis.
I think that’s a false dilemma and that there is a third choice
I just want to point out that maybe this was even a “false dilemma”, and there is a third choice (though I said “two major choices”, not the “two only choices”), but it does not make the first two choices non-existing. We can still choose to accept or choose to renegotiate (and we can also choose to standardize new pragma, as you point out).
There is a related debate: can the compiler remove range checks due to Undefined Behaviour? Compiler people insist that it “can do anything including rm -rf / yadda yadda yadda”. Security people are kindly asking to stop doing idiotic things and introduce security vulnerabilities.
The logic is I believe the same: some people decided that they can remove range checks explicitly present in the code due to some wording in language standard. They can undecide that.
I think the current (c/c++ compiler) interpretation of what is permissible when “discovering” UB is not reasonable, in any circumstance.
However there’s a much bigger problem in that C and C++ grossly overuse UB - the majority of UB, and certainly the UB where removal leads to security and correctness (w.r.t the host platform not the language spec today) has no business being UB - the C and C++ spec both have “implementation defined” and “unspecified” as a mechanism for describing behaviours that are not consistent across all platforms while also not being UB/“error”.
In WG21/C++ we’ve recently introduced “erroneous behaviour”, which is basically a compromise solution - there are people who want things like integer overflow to be an error, but if it’s unspecified or implementation defined, then overflow is “correct” and an conforming implementation could not report an error. Hence EB: “this is unspecified/ID, or you can report an error”.
Speculating from my understanding of the C standard, there are three major obstacles to defining behaviour:
When a supported platform goes bananas if the UB occurs (say the CPU reaches an incoherent state if it ever tries an unaligned load), then the UB extends to all platforms. There’s no such thing as “whether the behaviour is undefined or not is implementation defined”. If it’s undefined for someone, it’s undefined for everyone.
Traps, interruptions, signals… aren’t in the realm of “(implementation) defined”, not even “unspecfied”. So you can’t say that division by zero is supposed to trap, the only way to express it is to leave it undefined.
Benchmarks are a measurable tally which it is extremely hard to argue against: if you tell compiler writers to stop their bullshit and -fwrapv by default, many will point out the reduced performance in this or that benchmark from SPECINT, however small.
Not all those obstacles will apply to all UB, but overall it’s a fairly tall order.
When a supported platform goes bananas if the UB occurs (say the CPU reaches an incoherent state if it ever tries an unaligned load), then the UB extends to all platforms. There’s no such thing as “whether the behaviour is undefined or not is implementation defined”. If it’s undefined for someone, it’s undefined for everyone.
Undefined Behavior is “you cannot define any behaviour”, the existence of a single platform where the result of a behavior is is not sufficient for the behavior to be UB on all platforms. It’s for “you cannot in any reasonable model guarantee an outcome”. For example OOB access, dereferencing (int*)random(), etc has no coherent model. If one piece of hardware has “this operation randomises the cpu state” but every other platform is able to produce a self-consistent outcome, there’s no reason to say “this must be UB everywhere”.
Traps, interruptions, signals… aren’t in the realm of “(implementation) defined”, not even “unspecfied”. So you can’t say that division by zero is supposed to trap, the only way to express it is to leave it undefined.
These are literally the exact reasons implementation defined and unspecified behavior exist. The requirement for ID or unspecified behavior is not “every platform behaves the same way”, it is that a platform has a consistent behavior e.g. trapping on division by 0 is acceptable, producing a signal value is acceptable, the requirement is just the division by 0 always behaves the same way. ID and unspecified behavior exists specifically for the purpose of describing behavior that differs between platforms but is not undefinable.
Benchmarks are a measurable tally which it is extremely hard to argue against: if you tell compiler writers to stop their bullshit and -fwrapv by default, many will point out the reduced performance in this or that benchmark from SPECINT, however small.
I mean if you introduce a new behaviour definition that drastically changes the language complexity, or introduces widespread security bugs in existing code, in exchange for optimizations that only meaningfully improve performance for a set of benchmarks or small group of very specialized pieces of code, is that a reasonable choice?
Okay, first, I mostly agree with your objections. In fact I’d support them myself: it’s kind of ridiculous that just because something is undefined in ARM it can’t be defined in x86. It’s kind of ridiculous that trapping behaviour isn’t brought into the realm of, if not defined, at least implementation defined or unspecified. And it is utterly ridiculous that (runtime) performance ends up being prioritised against correctness to such an extent.
Now a couple details.
Traps, interruptions, signals… aren’t in the realm of “(implementation) defined”, not even “unspecfied”. So you can’t say that division by zero is supposed to trap, the only way to express it is to leave it undefined.
These are literally the exact reasons implementation defined and unspecified behavior exist.
The standard doesn’t seem to use them that way, currently. Anytime anything traps, it is put in the undefined realm for whatever reason. The second trapping is a behaviour that they accept to define, is the second it can be used for implementation defined or unspecified behaviour. I would love that to happen. Some work on the exact wording would likely be needed, but I would love stuff like division by zero to stop being undefined like it currently is.
I mean if you introduce a new behaviour definition that drastically changes the language complexity, or introduces widespread security bugs in existing code, in exchange for optimizations that only meaningfully improve performance for a set of benchmarks or small group of very specialized pieces of code, is that a reasonable choice?
As I said, utterly ridiculous indeed. I can see two biases leading to this anyway: the first is that benchmarks are right there for all of us to see. Performance is easily measured and quantified in very precise ways. It’s even a bit addictive how it plays out, I’ve seen it happen in my own practice when writing my cryptographic library: optimising EdDSA was quite the power trip.
Now we can see the vulnerabilities as well, but then the second bias kicks in: status quo in the letter of the law. See, compiler writers are respecting the standard when they introduce insane exploitation of obscure UB. But it’s written in the standard they’re within their right to do so, and they won’t hesitate to blame the programmers for relying on UB almost nobody knew about. It’s victim blaming, pure and simple, but with a twist: the law, written by compiler writers, is siding with compiler writers. It feels like “considering this exemption from November 1936 stating that short skirts means consent, and since we’ve established yours were showing your calves,, the court decides that (1) all charges are dropped and the five accused citizens are free, (2) court’s fee is on you, and (3) we are now pressing charges for indecency.
I mean you’re right: sometimes the law is bad and needs to change.
At first I kinda agreed with “people can undecide that”, but I think the problem is that for timing bugs it might be equvialent to “we can’t do any optimizations on any code”
Compiler developers are probably simply unaware of when they’re introducing timing vulnerabilities … it can happen anywhere, and new kinds of vulnerabilities are found on a regular basis
So it’s more like a cost-benefit analysis, and yeah probably some people take it overboard and chase speed at any cost
There probably need to be new compiler modes / new spec language introduced
In most platforms, the arithmetic operations cryptography cares about is are all constant time. On some platforms multiplication might not be, but overall it’s not so bad. Thus, we’re asking only one thing from compilers: don’t introduce branches where the original code didn’t have any. Conditionals, boolean logic, even comparisons are fair game, but numeric shenanigans should stay numeric shenanigans.
Compiler developers are probably simply unaware of when they’re introducing timing vulnerabilities
But they are aware when they introduce branches where there was none.
it can happen anywhere,
Not quite. Timing is very specific: just don’t let information flow from secrets to timings. This means don’t let the timing of any operation depend on any secret. In practice in a modern desktop CPU, this means avoid secret dependant branches, and avoid secret dependant indices. That’s about it.
Now in theory, compilers could indeed rework the computation of indices such that they depend on different sources of information than they did in the source code. My understanding is that those would not optimise anything at all, the fastest way to compute something is not to introduce spurious dependencies. Branches are basically the same: existing branches have existing dependencies, and compilers generally have a vested interest in keeping those dependencies to a minimum, just so the branch could be computed most efficiently.
Really, all compilers have to do here is stop introducing branches where there was none.
new kinds of vulnerabilities are found on a regular basis
Not that regular. And almost all involve a leak from secrets to timings: like speculative execution of code the attacker is not supposed to observe influencing the timing of a process the attacker has access to.
There probably need to be new compiler modes / new spec language introduced
Those would certainly help. Writing constant time selection for instance is a hassle, I’d rather just write my lookup more directly, such as CONSTANT_TIME_LOOKUP(array, index). The bigger problem here is that something, some important thing that used to be possible, is becoming less and less so.
In most platforms, the arithmetic operations cryptography cares about is are all constant time.
But they’re not, you just think they are.
This document, for example, makes it clear that actually having data operand independent timing would require disabling certain CPU features such as data-dependent prefetching and other such features which have a performance impact (and in future processors, increasingly so).
If you are not disabling these features in your code (and you are not, because you can’t, unless you use DOIT on newer Intel CPUs which probably is not even possible to use in userspace yet) then the instructions that you think are constant time, aren’t.
Even then, when using instructions with data-independent timing you might still be vulnerable to other data-dependent side channels such as power, thermal or frequency-based.
It looks like you misunderstand what cryptographic code actually requires. When we write constant time code, we don’t just avoid secret dependant branches, we also avoid secret dependent indices.
The notion was originally popularised by attacks on AES, which was accepted on the bogus notion that memory accesses were constant time. Turned out that caches are a thing, and cache timing attacks could recover an AES Key in milliseconds or less.
One might think the cache was the problem, but that’s not quite it: the problem is that the memory access patterns of a naive AES implementation ultimately depend on the secret key. To run securely, an implementation needs to stick to data access patterns that can be guessed from public information only: the length of the text is fair game, the ciphertext is fair game, but the secret key and plaintext are not. And in practice, modern algorithms use data access patterns that only depend on the length of the input text. Of the top of my head I can cite Salsa20, ChaCha20, BLAKE (all versions), SHA-2 and up, Argon2i…
So yeah, the CPU will indeed do prefetching magic and whatnot, and it will indeed use the value of the index, or anything this value depends on, to fetch that data. But if the value of the index (or pointer) doesn’t depend on any secret, then the resulting timings won’t either. They’ll vary for sure, but the only info they’ll leak would have been public to begin with.
Even if the CPU doesn’t depend on the actual value of the index, but on the values it might take on speculatively executed branches, in actual cryptographic code such indices are still secret independent, and the timing information would still only reflect public data.
The prefetching system would need to use a seriously overreaching heuristic to start depending on actual secrets even though there is no data flow from those secrets to the index regardless of branching. In fact, I currently seriously doubt the usefulness of such an overreaching heuristic: whatever it does has to run reasonably fast to even prefetch in time, so it probably makes sense to stick to values the index takes in actually (and speculatively) executed branches.
Most importantly though there’s no way in hell CPU designers aren’t already aware of this problem. They have to know letting such an surprising timing leak that way will inevitably lead to a stream of hard to mitigate vulnerabilities. They can’t just hide behind “timings aren’t part of the ISA, not our fault cryptographers don’t know how to read a data sheet”. Well I guess they can, but then they really deserve to be sued.
Even then, when using instructions with data-independent timing you might still be vulnerable to other data-dependent side channels such as power, thermal or frequency-based.
Let’s get real, outside of smart cards or similar, fancy side channels such as temperature, power, even electromagnetic emissions, only matter to the extent they actually influence timings. The attacker has to access those side channels one way or another. If it can read your sensors, I guess it might exploit them.
Frequency throttling based on temperature however, that’s a hard one. Hopefully it’s low-resolution enough that the power variations resulting from the repeated arithmetic operations of a cryptographic primitive are too low-resolution to leak any meaningful information, but I do confess I am slightly worried about that.
It looks like you misunderstand what cryptographic code actually requires.
I don’t think I misunderstand, I’ve known about what you wrote for many years already.
What you wrote is only a subset of what is written in this document, for example.
That document also contains other details which you have to get right so as to not leak secret information in your code.
Most importantly though there’s no way in hell CPU designers aren’t already aware of this problem.
They are, hence why the document I linked to, as well as the Intel DOIT document and other related documents (e.g. frequency side channel mitigations, etc) were created.
They have to know letting such an surprising timing leak that way will inevitably lead to a stream of hard to mitigate vulnerabilities.
They do know. Why do you think Intel DOIT was created in the first place?
Note that what you wrote is one of the prerequisites for using Intel DOIT. That is, on top of avoiding secret-dependent branches and indices, you also have to use something similar to Intel DOIT to avoid secret-dependent timing leaks. Think about it. If that were not true, why would Intel DOIT exist in the first place?
I think you are underestimating how complex modern CPUs are.
Frequency throttling based on temperature however, that’s a hard one.
Yes, and the temperature is correlated with power usage, so if certain secret keys cause more power usage than others (even if all the code is constant-time), you still have the same problem…
What you wrote is only a subset of what is written in this document, for example.
I’ve just read it from start to finish. Excellent document, I learned absolutely nothing. These are just the classics I had to go through writing my cryptographic library. I did like the formalisation in “Secret Independent Runtime”, “Secret Independent Code access”, and “Secret Independent Data access”.
Think about it. If that were not true, why would Intel DOIT exist in the first place?
My first guess is, they made questionable trade-offs before they became actually aware of security problems. Hence stuff like SPECTRE. If now they’re doing stuff that is screwing code that is already SIR/SIC/SID, I would suggest they stop. Maybe extend the notions of SIC and SID to speculative execution as well (as far as I know my code and libsodium’s are both okay on that front — apart of course from being written in C). but beyond that? Please stop.
Or at least wait for cryptographic accelerators to become ubiquitous, even if it means waiting for 20 years.
If you follow this link, you’ll realize it’s a link to the Intel DOIT document. That is, the “instructions whose latency is invariant to the data values” is not what you thought it was, based on your previous comments.
Many software developers think that instructions such as ADD have data operand-independent timing, but the actual hardware vendor of the CPU is telling you that this is not true (in normal operation), i.e. you have to set a special bit in the CPU that disables certain CPU features, in order to actually access such constant-time instructions.
My first guess is, they made questionable trade-offs before they became actually aware of security problems.
I agree. I understand that historically (when CPUs were simpler), what you wrote was true. But right now I’m only pointing out the current status of this situation, taking into account the complexity of modern CPUs.
but beyond that? Please stop.
Well, I understand this sentiment, but I don’t think the need for speed will end anytime soon, especially since CPU frequency increases have slowed down considerably (compared to historical trends)…
It’s a basic fact of life that faster / more efficient CPU = way more CPU sales.
Edit: also, this part of the document validates some of the points I’ve brought up and it also offers… well, an interesting suggestion at the end:
Another important observation is that a change to compiler flags might undo all our efforts to create data independent execution! Until compiled languages create primitives for data-independent execution, this will always be a problem. Although it is inconvenient, the safest solution to this is to write the secret-dependent code in assembly language. Another option is to validate the compiled code as part of the build process and create an automated trigger if the compiler decides to change the output code.
Many software developers think that instructions such as ADD have data operand-independent timing, but the actual hardware vendor of the CPU is telling you that this is not true (in normal operation), i.e. you have to set a special bit in the CPU that disables certain CPU features, in order to actually access such constant-time instructions.
They’re insane.
64-bit multiply on a 32-bit CPU I can maybe understand, but addition? They’re insane.
Edit: also, this part of the document validates some of the points I’ve brought up and it also offers… well, an interesting suggestion at the end:
Compiler writers are insane too.
I mean, sorry about the curiosity stopper, but this is all going way, way too far. We should rethink it all from the ground up at this point. Thanks for your time.
The logic is I believe the same: some people decided that they can remove range checks explicitly present in the code due to some wording in language standard. They can undecide that.
This is a common complaint but I think this is already easy to fix if you’re so inclined. The trick is to realize that you don’t have to rely on the default optimization options of C compilers. There are already C compiler flags that you can enable not only to preserve those checks (e.g. by converting signed integer overflows into defined behavior, say with -fwrapv), but also to automatically introduce checks whenever (say) signed integers overflow (e.g. with -ftrapv) or whenever you try to access buffers out-of-bounds.
The reason you don’t see Linux distributions enabling these compiler flags for building C software is because these checks cripple performance, which is also why C compilers behave this way by default in the first place. But again, if you’re concerned about this, you can just enable these options when compiling. This is even easier in distributions such as Gentoo or NixOS, in which you can easily choose what compiler flags to use for compilation (at the cost of having to build software locally instead of relying on distro build servers).
The problem is that if you write code that is only “correct” if you use compiler arguments then your code is not conformant C/C++, and on compilers that don’t have that flag your “correct” code may become incorrect.
The real kicker though is the interpretation of what a compiler is allowed to do in the event of UB - because as you migrate to a different compiler you suddenly have entire code paths removed
The problem is that if you write code that is only “correct” if you use compiler arguments then your code is not conformant C/C++, and on compilers that don’t have that flag your “correct” code may become incorrect.
If you want your code to be portable then you can still write standards conformant C/C++, and if you want, you can even do signed integer overflow with defined semantics since it’s possible to do it in portable C/C++ (using compiler intrinsics for speed when they’re available, and I think the newest version of C also have these as built-in operations).
Then you can additionally use -fwrapv if available, just in case.
No, you cannot write standards conformant C/C++ and then use -fwrapv if it’s available, just in case - the whole point of this discussion is that if fwrapv changes the behavior of your program at all, your code is non-conforming. The problem is the broken interpretation of what “UB” means, especially given the compiler can only actually do any of these optimizations for the specific cases of UB that have no business being UB in the first place. The modern compiler interpretation of UB means that any path, no matter where, after inlining (including LTO) that leads to “UB” - even if the behavior is well defined on the target - can be arbitrarily removed.
e.g
int f(int i) {
if (i == 0) {
printf("zero!\n");
}
return i % 0;
}
Can be “correctly” compiled to remove the printf. More over, if you replace the printf with say, “myErrorHandler()” that could be removed anyway - only in C11, pretty much a decade after the introduction of “the compiler can say UB never happens” did an actual standard attribute get introduced to make it possible to guarantee the compiler would not be able to remove it.
In this case it is super obvious that it has happened, but it does not need to be, and frequently is not - C’s copious overuse of UB, coupled with a adversarial interpretation of UB means you cannot reasonably say any large codebase is free of UB.
The only way to remove this nonsense interpretation is per-compiler, non-standard, mode flags - the linux kernel already does this specifically because of the endless security vulnerabilities this interpretation results in.
But let’s get to the other part of this: when do these “optimizations” actually occur? Only in cases where the UB can be proved, in which case by the definition the compiler is using (The removal of code paths leading to UB is only valid under the interpretation “any code that causes UB is erroneous, and so can be assumed cannot happen”) should be producing an error during compilation.
The only place they consistently do not result in security bugs, and produce beneficial performance is in for loops, and that’s the scenario that actually introduced this nonsense interpretation of what could be done in the face of UB. Specifically a bunch of major compiler/codegen benchmarks use for(int ….) loops, where you can ostensibly optimize in useful ways if you pretend overflow is impossible (constant folding, etc). The fun thing is this is why C++ at least eventually made unsigned overflow only stop being UB: you can make that change without hurting benchmarks. I’m not sure if C has done this yet.
No, you cannot write standards conformant C/C++ and then use -fwrapv if it’s available, just in case - the whole point of this discussion is that if fwrapv changes the behavior of your program at all, your code is non-conforming.
Yes, in this scenario the point was to try to write conformant C/C++ and the -fwrapv is only there to catch accidentally non-conformant code. This does not mean that your code has to rely on the behavior of -fwrapv, it only means that the -fwrapv is there to catch unforeseen bugs (when this option is available).
So in this scenario, your accidentally non-conformant code is incorrect with all compilers, it’s just that in those with -fwrapv the consequences would be less serious (if you are to believe what most people say about UB, that is).
the definition the compiler is using (…) should be producing an error during compilation.
Well, consider this slight variation of your code:
int f(int i) {
if (i == CONFIG_VALUE) {
printf("some message!\n");
}
return 100 % i;
}
In this function, when CONFIG_VALUE is zero the compiler is free to remove the printf() but there’s no reason why this should result in a compiler error, because this code may well be 100% correct and never call f() with zero as an argument.
If code such as this would result in a compiler error, it would just produce a bunch of false positives (especially when using macros and inlined code, I suppose) and I don’t think you (or any other developer) would like that.
The fun thing is this is why C++ at least eventually made unsigned overflow only stop being UB: you can make that change without hurting benchmarks. I’m not sure if C has done this yet.
Unsigned overflow is not UB in C and as far as I know, it never was…
In this function, when CONFIG_VALUE is zero the compiler is free to remove the printf() but there’s no reason why this should result in a compiler error
This is the literally the absurd compiler interpretation of UB we’re complaining about: a person has written code explicitly to check for a value, and the compiler is saying that path cannot happen, not because it has proved the value cannot be 0, but because later on the value being 0 would be UB, and therefore an error. So the correct thing to do would be to diagnose (and for most serious projects any diagnosis becomes an error) that this code path is statically determined to lead to an error. If the compiler is unwilling to report an error, it should not pretend that the code cannot happen.
If code such as this would result in a compiler error, it would just produce a bunch of false positives (especially when using macros and inlined code, I suppose) and I don’t think you (or any other developer) would like that.
I would 100% take any amount of compiler errors over the compiler introducing vulnerabilities and errors based off a known-false heuristic. More or less every C/C++ project I have worked on compiles with warnings are errors, and more or less every possible warning enabled.
That said, you’re right, that much noise would be annoying, and maybe it would get compilers to stop trying to use this absurd definition of UB. The issue is not “that above code should produce an error”, the issue is the compiler has decided to say UB is an error, an therefore any code path that implies later logic would be UB is not possible, with no logic to support the claim.
Unsigned overflow is not UB in C and as far as I know, it never was…
Ah so this was just an obnoxious C++-ism, and we just have to continue to deal with the gross obsession with retaining it for signed integers in both.
the compiler is saying that path cannot happen, not because it has proved the value cannot be 0, but because later on the value being 0 would be UB
The compiler has to prove that the code it emits (for whatever target) satisfies the semantics of the language that the source code is written in. If that code would have undefined behaviour then the proof for that case is trivial, since any target code would satisfy the semantics.
“The value being 0 would be UB” is not a thing. If there is UB, there is no value after that point. (Sorry, that was badly put). It’s correct that the compiler has proved that the value being 0 would have resulted in UB and therefore there are no semantics in that case.
The compiler has to prove that the code it emits (for whatever target) satisfies the semantics of the language that the source code is written in. If that could would have undefined behaviour then the proof for that case is trivial, since any target code would satisfy the semantics.
No. Prior to the 2000s, something being UB meant “you don’t get guarantees of what happens when you trigger UB”, after the 2000s the behaviour became “UB is an error, and if a path implies UB later on that path cannot happen”.
The semantics of the language say you don’t get a guaranteed outcome after triggering UB, time traveling UB was an invention long after C and C++ had been standardized where the semantics changed from “you don’t get a guaranteed outcome” to “the compiler can claim the existence of UB proves this code path cannot occur”. If a compiler can prove UB occurs, and the compiler is making codegen on the basis that UB is an error, then it should be diagnosed.
“The value being 0 would be UB” is not a thing. If there is UB, there is no value.
You’re just restating the definition I’m arguing is wrong, not justifying it.
No. Prior to the 2000s, something being UB meant […]
The standard was clear that UB imposed no requirements since ’89.
after the 2000s the behaviour became “UB is an error, and if a path implies UB later on that path cannot happen”
After about that time compilers did start to make more liberal use of transformations whose soundness depended on the fact code exhibiting UB has no semantics. That’s not quite the same thing. The distinction is subtle but, since you seem to be implying that you think compilers are behaving incorrectly, it is important.
The semantics of the language say you don’t get a guaranteed outcome after triggering UB, time traveling UB was an invention long after C and C++ had been standardized
The C standard was never particularly clear about “time travelling” UB and perhaps never allowed it. I recently saw a proposal to forbid it (via a non-normative footnote), which I believe was accepted for C ‘26. There may be some cases where GCC/Clang get this wrong although I’ve found it quite difficult to trigger them with current versions. I would be interested to see a counterexample, if you do have any.
In C++ the standard (at least since ‘11, I’m not sure about ‘98/‘03) clearly specifies that there are no requirements for an implementation on an execution of a program which exhibits UB at any point. That’s an explicit allowance for “time travelling UB”. I don’t know if compilers started implementing this before it became part of the standard, but in any case I have also found it very difficult to produce a case showing, with recent versions of available compilers, an instance of this “time travelling” UB.
Edit: sorry, I missed that the example under discussion is actually about “time travel UB”. As I said above, this is explicitly allowed in C++, but at least three compilers in common use (GCC, Clang, MSVC) do not make use of it, at least for this specific example. So far from an “absurd compiler interpretation of UB” it is if anything an example of compilers very clearly not taking full advantage of leeway that is very clearly provided to them.
the semantics changed from “you don’t get a guaranteed outcome” to “the compiler can claim the existence of UB proves this code path cannot occur”
Seems to me that a large part of the problem is that that is true but so crazy that a lot of people have trouble believing it. Being true but crazy and therefore implausible is common in some of the other fields I’ve worked in but maybe not so common in programming.
If a compiler can prove UB occurs, and the compiler is making codegen on the basis that UB is an error, then it should be diagnosed.
When the compiler can prove that UB will occur, it will show you an error.
But this almost never happens (except in very limited circumstances like when the values are constant) because the compiler almost never can prove that UB occurs.
This example code in my comment shows what I mean.
Here, the compiler cannot prove that f() will be called with zero as an argument. But it will still remove that “if” check when CONFIG_VALUE is zero because it knows that this will have no effect on the program, since if i was zero, the program would crash [1].
UB would only be performed is f() is called with zero as an argument. But the surrounding code may never call f() with zero as an argument, so the code may well be 100% correct, which means it doesn’t make sense to emit a diagnostic (this would lead to very annoying false positives in all sorts of code, including almost all code containing signed arithmetic).
[1] Strictly speaking, it won’t remove the check because the C compiler cannot prove that printf() won’t have an effect on the program, but for argument’s sake, imagine that printf() was a side-effect-free function which the compiler could observe.
So the correct thing to do would be to diagnose (and for most serious projects any diagnosis becomes an error) that this code path is statically determined to lead to an error. If the compiler is unwilling to report an error, it should not pretend that the code cannot happen.
But the code path does not lead to an error. Remember, the surrounding code never calls f() with zero as an argument, so the code is correct. In other words, i != 0 is a precondition for the f() function.
The exact same logic that allows the compiler to remove the 0 check earlier allows it to remove the >= 1000 check in this code.
In general, that’s not true. doSomething() might contain an infinite loop (that does I/O), or it might call exit(), or abort(), etc. – so C compilers cannot remove the >= 1000 check in this code because it would affect the behavior of the code.
It would only be true if doSomething() does something “inconsequential” (e.g. if it’s a pure function) and the compiler chooses to inline its code, including any functions it calls (recursively).
Although, now that I realize it, the same issue exists with the example I gave, but I guess my point would still be valid if you replace printf() with some other code that the compiler can fully observe and that doesn’t affect the control flow.
Do you think this is reasonable?
Let’s assume that doSomething() does something “inconsequential”, i.e. that would not affect the control flow, etc, and that the C compiler chooses to inline it. Or replace doSomething() by such inconsequential code.
Then the answer is “yes”, I think it’s reasonable to remove the check because it would lead to a crash anyway and therefore preserving the check would not affect the behavior of the code.
doSomething() might contain an infinite loop that does I/O,
Infinite loops are UB, so the compiler is free to assume all functions complete (and return prior to the introduction of _Noreturn in C11), this is why it can remove the the (I == 0) branch in the % I example. Which is my point: if the code can be removed in the %0 case, it is just as conforming to remove it in the >=1000 example.
It’s not a matter of “does the function have side effects” or “can it be inlined”, the definition of what is permissible with UB (according to compilers) is that fact that %I is undefined if I is 0, or [I] is undefined if I>=1000 means the compiler can claim it has proved the condition cannot occur. It is not a matter of “does the function do anything, or have side effects” because the existence of the UB “proves” I cannot be 0, or >=1000 (depending on the example).
Let’s assume that doSomething()
Again we don’t need to, because it cannot be called, because for it to be called requires that I has a value that leads to UB, which “proves” I cannot have the value.
That’s literally the problem we’re complaining about: the compiler implementations decided in the 2000s (the earliest example of this interpretation of UB is 2005) that rather than saying “we cannot guarantee what will happen if you trigger UB” they could switch to “if we discover UB we can claim that it proves a path does not happen”.
No, they’re not. Infinite loops are well-defined behavior in C.
There is a clause introduced in C11 which states that a loop which does not do I/O, does not access volatile objects and does not perform synchronization or atomic operations in its body may be assumed to terminate, but this is still well-defined behavior.
Again we don’t need to, because it cannot be called, because for it to be called requires that I has a value that leads to UB, which “proves” I cannot have the value.
No, there is no undefined behavior if doSomething() calls exit(), for example. The return buffer[n] statement would never be reached.
So the compiler is not allowed to remove that check.
To be honest, I think you are very confused about what UB is and how it works… while UB can sometimes have surprising consequences, it’s not as big of a demon as it’s been made out to be, lately.
To be honest, I think you are very confused about what UB is and how it works… while UB can sometimes have surprising consequences, it’s not as big of a demon as it’s been made out to be, lately.
You’re right, I don’t know anything about UB, just the fallout of the invention of proof-by-UB.
You could argue that the real issue is the gross abuse/overuse of UB in C/C++ which exposes more or less all non-trivial code to UB, but the reality is that none of the proof-by-UB driven optimizations apply to any of the cases of legitimate use of UB in C/C++.
But I also accept that at this point your opinion is that proof-by-UB is legitimate, and we’re never going to agree. My opinion is that the compiler should not be an adversary, and proof-by-UB, especially when UB is so trivial, is adversarial.
but the reality is that none of the proof-by-UB driven optimizations apply to any of the cases of legitimate use of UB in C/C++.
I’m sorry but I don’t even understand what you’re saying. What the heck is “legitimate use of UB in C/C++”?
If your code performs UB, then it doesn’t have defined semantics. It is a bug and as far as the C language is concerned, it is basically equivalent to a crash. You’re asking the computer to do something invalid.
So there is no such thing as “legitimate use of UB in C/C++”.
My opinion is that the compiler should not be an adversary
Sure, I agree.
and proof-by-UB, especially when UB is so trivial, is adversarial.
The compiler is not doing anything other than optimize your code. If you think the compiler is being adversarial because it removes code that would have no significant effect on your program, then I don’t know what to tell you…
I’m sorry but I don’t even understand what you’re saying. What the heck is “legitimate use of UB in C/C++”?
It is a stunning example of unbelievably terrible/confusing grammar/writing :D
What I was meaning was legitimate use of “undefined behavior” in the language specification, not “legitimate use of UB in code”, because as you say that would be nonsense :D
i.e. the C and C++ spec both use UB in a wide array of cases where the behavior should be either unspecified or implementation defined, and has no legitimate reason to use UB.
Then I have a solution for you: pretend that the C language specification says “unspecified” or “implementation defined” instead of UB.
As long as you’re not looking to rely on implementation-defined behavior (which would make your code non-portable), the effect is the same from the developer’s point of view.
Literally the whole point of this entire thread is that you can’t do that specifically because of the “compilers can optimize on the assumption the UB cannot happen”.
Take the canonical (note that this is not an time-traveling UB scenario this is just UB vs. ID/unspecified example):
if (a + b < a) { ... }
The only reason a compiler can remove this check is explicitly because overflow is UB, you cannot “pretend it is unspecified/ID”. If overflow is unspecified or implementation defined that compiler cannot just pretend it does not happen.
The semantics of a operation being specified as UB vs unspecified or ID are completely different.
In this function, when CONFIG_VALUE is zero the compiler is free to remove the printf()
I think that is true for C++, but not necessarily for C. In any case, it’s worth noting that neither GCC nor Clang (14.1/18.1.0 respectively) remove the call to printf() in this example, for either language, at -O3 or at -Os.
Do you realize that this would produce a warning for any code that contains a division by a variable?
Extend the same thing to the other arithmetic operators and every time you use a variable in signed arithmetic (addition, multiplication, etc), you’d have a warning due to possible overflow.
I think that seems a bit too much especially when you can’t prove to the compiler that no mistake would actually happen.
It should only produce a warning when I have not proved to the compiler that the divisor is nonzero with a test or an assertion.
It’s fine if you think it might be too annoying for you (so don’t turn it on) and it’s fine if you think it might be too annoying for other arithmetic operators (I probably agree) but nevertheless I think it would have helped me avoid bugs.
when do these “optimizations” actually occur? Only in cases where the UB can be proved
As I understand it, the logic in the compiler is more like the optimizations assume UB never happens. This is obviously false, therefore the optimizations are unsound, therefore they easily produce complete nonsense.
My uncharitable view is that UB is used as a shortcut so that optimizations are easier to write because they don’t have to consider the tricky cases. If the compiler writers are aiming to make optimization easier, they aren’t going to bother proving that UB might happen before exploiting the consequences. The nonsense compilations are capricious because the compiler deliberately fails to understand the program.
As I understand it, the logic in the compiler is more like the optimizations assume UB never happens. This is obviously false, therefore the optimizations are unsound, therefore they easily produce complete nonsense.
Your understanding is not quite correct. The compiler acts according to the principle (which is explicitly stated in the language standard, so it’s more than an assumption) that, if UB has happened, there are no requirements that any particular behaviour should be observed.
One consequence of this is that a compiler can eliminate from consideration, when deciding what target code to emit, any code paths that necessarily exhibit UB. People do often describe this as “assuming the UB never happens” but technically that is wrong. The compiler does not assume that UB never happens, it just applies the fact that there are no requirements for behaviour to cases where it can prove that UB does happen.
The nonsense compilations are capricious because the compiler deliberately fails to understand the program.
Compilers does not deliberately misunderstand programs. This claim that they do implies that you think the compiler should have understanding of the language that is different from what the very document that describes the language semantics says they should. Different compilers trying to guess at what a programmer might have intended is a path to madness and is exactly why the standard exists.
I don’t think the compiler writers can claim that they are not deliberately misunderstanding the program when they wrote the specification that says they are allowed to misunderstand the program.
For optimizations around things like integer ranges, it may be doable in many cases for the compiler to prove things rather than assume things. But this view really starts to break down when you consider UB from stray pointers.
For example, let’s say you’re writing a function that takes a couple of pointers as parameters. One of them points to an input array, and the other is to an output array. The optimizer would like to be able to perform common subexpression elimination, hoist expressions out of loops, re-order things across loop boundaries for vectorization, split or fuse loops, cache values from the input array in registers, etc.
These are all totally doable, highly profitable optimizations, as long as the input and output pointers do not alias. But if writes into the output array might change the input array, suddenly the optimizer’s hands are tied behind its back. For example, those writes might be in between two otherwise-equivalent expressions, which must both be computed just in case. In a loop that writes to the output array, expressions that would ordinarily be identical across every iteration must now be performed repeatedly, just in case. Many values will have to be reloaded from memory over and over, just in case.
Things get even worse when you go beyond bare arrays- for example, if you took two std::vector<int>&s, can the data pointer of one alias with the length or capacity of the other? Of itself?
C and C++ try to address this problem with type-based alias analysis aka “strict aliasing,” where pointers of one type are assumed not to alias with pointers of incompatible type. This is annoying in the tiny fraction of code where you’re trying to mess with byte-level representations, but there are workarounds there, and it’s otherwise an easy rule to understand and follow. Expecting the compiler to prove this is unreasonable- it would require intractable analysis and whole-program compilation.
And it’s still not enough- Fortran is famously better than C at this kind of array crunching because it instead has a rule that function parameters shouldn’t alias. This means even if the input and output arrays have compatible types, the optimizer can still perform these transformations.
(Of course C also has restrict to cover the cases that strict aliasing does not. But this is merely a question of default syntax: the compiler can’t verify that restrict is used correctly any more than it can verify that strict aliasing rules are followed. If you want the compiler to verify this kind of thing you need to add something like Rust’s lifetimes to the type system.)
I think it would be safer for compiler writers to provide tools to the programmer so the programmer can explicitly allow the compiler to perform optimizations that the compiler cannot otherwise prove to be sound.
Yes, I also prefer this approach in a vacuum- you can get some nice properties out of it, as seen in e.g. Rust. It is quite a deep change to the language, though, and compiler writers still need to do something useful with all the C and C++ code we’re still using.
The modern compiler interpretation of UB means that any path, no matter where, after inlining (including LTO) that leads to “UB” - even if the behavior is well defined on the target - can be arbitrarily removed.
C code that exhibits UB does not have well-defined behaviour on any target. The behaviour of C is defined quite specifically in terms of an abstract machine. The implementation (compiler) generates code on a real target to satisfy the semantics of the program on that abstract machine, of which there are none if UB is invoked.
The fun thing is this is why C++ at least eventually made unsigned overflow only stop being UB: you can make that change without hurting benchmarks. I’m not sure if C has done this yet.
Unsigned overflow is well defined in C since at least C99. I’m not sure unsigned overflow were ever undefined, but I haven’t checked the oldest standards.
Yeah I had to look this up, it turns out C++ just had what is arguably a spec error in not matching C wrt to unsigned overflow, however it doesn’t negate that there’s no reason for signed overflow to be undefined except for it helping benchmarks.
If you want your code to be portable then you can still write standards conformant C/C++
But you can’t rely on it being constant time, while 10 years ago you mostly could. The standard didn’t consider it, and now compiler writers are introducing regressions left and right because the “law” says they can.
The trick is to realize that you don’t have to rely on the default optimization options of C compilers.
I do. I ship C libraries (okay, just the one), and as such can’t rely on my users to activate those flags.
Even if I tried and wrote in big bold blinking letters that the compilation flags used in my build system must absolutely be reproduced, this would still be a violation of the least astonishment principle. If I say I wrote my thing in C, users will likely expect that it is actually written in C, and treat any deviation from strict conformance as a bug, as well as shout on all their forums about how irresponsible I am for shipping code that doesn’t follow the standard. I mean, I already got some shit for relying on implementation behaviour, even though (i) no one could name a platform that behaved differently, and (ii) the exact same choice had already been made by DJB himself.
If I say I wrote my thing in C, users will likely expect that it is actually written in C, and treat any deviation from strict conformance as a bug
Then write standard C and treat any deviation from strict conformance as a bug.
Then, additionally, use -fwrapv in your build system (well, you can use -ftrapv, UBSAN, etc, for running tests) and recommend in your README/INSTALL documents that users also use that flag. The point in this case is not to rely on -fwrapv, but just use it when it’s available to diminish the consequences of bugs in your code.
Judging by the amount of debate that has been happening around this, there seem to be different assessments of the overall situation. I’m pretty sure that everyone involved is actually aware of -fwrapv, so apparently there is more to that.
I’m pretty sure that everyone involved is actually aware about -fwrapv, so apparently there is more to that.
When searching for comments in that thread that mention -fwrapv, the only remaining complaint I see is whether -fwrapv should be enabled by default or not. To be clear, there is one comment reporting that this option makes code 2x slower in one instance that they checked, so this option is not free: the cost vs benefit of enabling this option by default is subjective, and depends on how much you value security over performance.
But again, you can create your own Linux distro or variant that enables this option. As I mentioned, some Linux distros already allow you to enable this option globally (at the cost of having to build software locally). And judging from one comment in that thread, certain software such as the Linux kernel already enables this option by default and there are open bugs for Firefox and Chrome to enable this option as well (at least 1 year ago, apparently). If you’re concerned about this, it might be worthwhile to check those bugs to see whether they have enabled this option (or why they haven’t).
Yes! Or you could convince compiler writers to rethink their choices.
Good luck with that :) As I said, the decision is subjective so not everybody agrees with you. Another suggestion for taking matters into your own hands instead of complaining about someone else’s projects: you can fork gcc and clang, change -fwrapv to be enabled by default, publish the forks with names such as gcc-secure / clang-secure (although, check for trademarks first) and otherwise keep them in sync with upstream.
If you really think your forks will be better than the original projects, people will adopt them and they will either supplant the original projects over time, or force the original projects to make the same change. That’s how free software works, right? It would be similar in spirit to certain aspects of what happened with the whole gcc, pgcc and EGCS saga, only it would be 100x easier to do since you’d only be making a trivial change.
I mean, technically of course I understand what you’re talking about, but I still believe that it’s not a very realistic and useful approach to treat gcc and clang like somebody’s pet projects. They are organizations deeply integrated into the industry, having their importance derived from the industry (including you and me) and we, as industry participants, have all rights to complain, demand, negotiate and convince members of those organizations to improve their practices and processes.
I agree, but at a certain point you have to accept that there will always be disagreements, especially around issues such as these, which are subjective.
This debate is not new by any metric, it’s certainly been going on for many years and I believe that both the decisions at the level of the C standard and at the level of the major compilers were taken quite consciously and deliberately, so I don’t think complaining about this indefinitely will do any good.
In fact, I think these issues are usually decided by measuring industry adoption and as far as I can see, the industry has decided to favor performance over security with regards to this specific issue. The only major outliers in this matter seem to be the Linux kernel and Rust, the latter of which being a different language altogether (with a different focus and priorities).
So I think it would be more productive, say, to convince browsers, Linux distros and other major C/C++ software projects to change their default compilation options than to convince compiler developers, who in general prefer to choose defaults that agree with the surrounding software ecosystem. Presumably, at some point the adoption would be enough to tip the scale into favoring -fwrapv by default, and C standard modifications usually follow compiler and industry practice.
This debate is not new by any metric, it’s certainly been going on for many years and I believe that both the decisions at the level of the C standard and at the level of the major compilers were taken quite consciously and deliberately, so I don’t think complaining about this indefinitely will do any good.
Perhaps it is time to take things to a more personal level.
Who specifically is blocking proposals that would define things like signed integer overflow?
Who specifically is letting patches in that makes it impossible to write constant time code, and therefore stops cryptography from being even possible?
I’m not talking about organisations, I want the name of the actual person who made the call. Given their impact, they should be public figures.
To be clear, there is one comment reporting that this option makes code 2x slower in one instance that they checked
I bet my hat this could easily be fixed at the source level. See the differences in generated code, and rework the corresponding source code to the compiler can more easily transform it.
Maybe they could reconsider that decision along the lines that the cryptography people suggest.
So you are asking compiler vendors to not remove code they can prove to be useless?
The compiler removes bounds checks in cases where something did happen that would be UB if it was out of bounds before the bounds check happens… It’s a bug in the code, not in the compiler.
It would be great if the compilers helped to spot this bug instead of silently removing the code though.
Some CPU instructions are known to be variable-time, but most simple instructions are constant-time in CPUs in actual use. One example of this is the recent KyberSlash attack [1], which exploited the variable-time of the x86 DIV instruction for key recovery.
most simple instructions are constant-time in CPUs in actual use
My understanding is that in general, software developers tend to rely on such observed timings but mainstream CPU vendors make no guarantees that these instructions are actually constant-time (outside of certain exceptions like the document I linked to, which requires enabling a special CPU mode, or perhaps older or simpler CPUs).
One of my points is that it’s not safe to make this assumption. There are many reasons for why an instruction of a modern CPU can be faster or slower to execute than expected and this may depend on many non-trivial factors. The other point is related to guarantees in the C language:
One example of this is the recent KyberSlash attack [1], which exploited the variable-time of the x86 DIV instruction for key recovery.
This is a brilliant demonstration of my points: two flawed assumptions were being made here: 1) that compilers would always translate integer division operations into the DIV instruction on x86 (which is not necessarily true), and 2) that the DIV instruction was constant-time (which doesn’t seem to be true either).
It’s important to realize that even if the DIV instruction was constant-time, the C language makes absolutely no guarantee that integer division is constant-time, so C compilers are free to use whatever CPU instructions they want to, including variable-time instructions. The actual CPU instructions that compilers use may even change depending on the surrounding (and even distant) code, compiler versions, compiler platform, available libraries, optimization levels, and even other unrelated compiler flags.
This is even mentioned in the FAQ of that issue:
Does a division in C, C++, Go, Rust, etc. always turn into a division instruction in machine language?
Not necessarily. (…)
And this is not limited to division, it applies to virtually any language construct.
There are many reasons for why an instruction of a modern CPU can be faster or slower to execute than expected and this may depend on many non-trivial factors.
What matters is whether that is data-dependent or not. Division speed is often data-dependent; multiplication has been occasionally data-dependent; variable rotation/shift will be on chips without a barrel shifter. I haven’t heard of a chip where boolean instructions or single-word add/sub are data-dependent, though it’s not impossible.
This is a brilliant demonstration of my points: two flawed assumptions were being made here: 1) that compilers would always translate integer division operations into the DIV instruction on x86 (which is not necessarily true), and 2) that the DIV instruction was constant-time (which doesn’t seem to be true either).
No, that was not the assumption. The assumption was that division by a known constant would be turned into constant-time multiplication plus shift, but when optimizing for size compilers will sometimes choose DIV instead. DIV has been known to be variable-time for a long time. The fix was to do the multiplication dance by hand [1] (though nothing stops the compiler from recognizing the division and still outputting DIV).
Of course the core issue here is the lack of predictability and/or control over what compiler optimizations are performed.
No, that was not the assumption. The assumption was that division by a known constant would be turned into constant-time multiplication plus shift, but when optimizing for size compilers will sometimes choose DIV instead.
This doesn’t matter. The timing of multiplication instructions are also known to be data-dependent on certain CPUs (like certain ARM CPUs) and platforms. This is even mentioned in the FAQ.
Relying on multiplication instructions being constant-time is making exactly the same mistake as relying on division instructions being constant-time. The timing of these instructions can vary depending on the exact values of the operands and many other non-trivial factors. There is a reason why CPU vendors don’t make promises about these instructions having data-independent timing.
The fix was to do the multiplication dance by hand [1] (though nothing stops the compiler from recognizing the division and still outputting DIV).
Exactly. And furthermore, even despite compiler optimizations, the timing of multiplication instructions can also be data-dependent, as demonstrated in this security flaw. So that “fix” is not really a guaranteed fix for the issue, in principle it’s just papering over it.
Of course the core issue here is the lack of predictability and/or control over what compiler optimizations are performed.
Indeed, but not just that: it’s also the lack of timing guarantees in modern CPUs. This document by Intel indicates they seem to be trying to address this issue, at least.
But this also needs to be addressed at the compiler or C language level, i.e. some construct needs to be created that forces compilers to use such special CPU timing modes / instructions (and error out if they’re not available), otherwise the only way to use this reliably is by coding in assembly.
Intel’s DOITM only makes things more complicated, seeing that it can only be enabled by writing to an MSR which requires kernel-level access, leaving it to the operating system to choose whether to enable, disable, or expose it to the user. But even in the case where one CPU maker gave guarantees about anything, that still leaves you with every other maker that doesn’t.
But anyway, CPUs are a hell of a lot more stable than compilers when it comes to suprising behavior. You can find some CPUs where multiplications are not constant-time, such as the M3 (and the Intel 486, and the PowerPC 7450, and …), but that’s the exception rather than the rule. Whereas even the same compiler will surprise you across versions, optimization levels, etc.
Relying on multiplication instructions being constant-time is making exactly the same mistake as relying on division instructions being constant-time. The timing of these instructions can vary depending on the exact values of the operands and many other non-trivial factors. There is a reason why CPU vendors don’t make promises about these instructions having data-independent timing.
Okay, great, CPU makes no timing guarantees at all, and we’d be irresponsible to rely on them on this one. Now I’m at the hotel with no other connectivity than their public WiFi, I need to make a purchase, have a private conversation, whatever, I need a secure channel.
Okay, great, CPU makes no timing guarantees at all, and we’d be irresponsible to rely on them on this one.
Good, the first step to fixing a problem is admitting you have one :)
What do you propose?
I propose three things:
That modern CPU vendors implement a feature similar to Intel DOIT. But in a way that’s usable in userspace, in a more fine-grained manner.
This feature would allow you to make cryptography safe in assembly.
That C compilers implement a function attribute that would force code in that function to be compiled in a way that exclusively uses such instructions (or error out if it’s not possible). This would probably require limiting which language features can be used in such functions.
This feature would allow you to make cryptography safe in implementation-defined C.
Eventually, the feature in 2) is standardized to be an official C language feature.
This feature would allow you to make cryptography safe in standard, portable C.
It would take many years to implement these steps, but I don’t see any other feasible way to achieve the same goals.
Edit: While 1) is not available, you can still implement 2) using normal CPU instructions that are usually constant-time. While not necessarily perfectly safe, it would still be a big improvement compared to the status quo.
And when 2) is not available, please please do not introduce any change in the compiler toolchain that causes it to insert any branches in code that previously didn’t have any. Also, scour for patches that made those kind of transformation, and revert them.
In fact, I would urge compiler writers to keep those changes reverted whenever 2) does not apply, so that code that is compiled against old standards doesn’t suddenly start breaking.
“1188 Haskell packages from LTS Stackage. […] Furthermore, 10,899 (28.0%) were fully desugared into Applicative and Functor combinators and thus would not require a Monad constraint.” б) “The Haxl codebase at Facebook. […] and 7,600 (26.9%) were fully desugared into Applicative and/or Functor.”
Yes. My theory is that relational model was being implemented (as SQL) at uniquely unlucky time: before sum types and newtypes got traction in programming languages.
Lack of sum types has led to the entire NULL confusion. Lack of newtypes causes mistakes related to comparing IDs of different entities. There is an idea of domain in SQL standard but somehow it did not go to production really.
But sadly they suffer from limitations which oft make them unsuitable (e.g. no foreign keys). And no sum types though it does have basic enumerations (which also have issues iirc they’re always 4 bytes).
Also using them in practice is a pain. Even basic enums are not well supported by many clients. I have no hope for custom types. Unfortunately many libraries stick to the bare minimum implementation.
The null problem in C/C++ pointers and Java and the JRE is different from NULL in databases. In the former, null is always possible and checking for null is usually not comprehensive because it would clutter the code too much with unnecessary checks. In SQL, columns can be NOT NULL. So NULL in SQL is like the simplest sum type: Option<T>. It’s a design decision in the model to make something nullable.
It’s a design decision in the model to make something nullable.
It does feel like SQL just missed the mark on this, though, because actually it’s a design decision to make something not nullable. The fact that non-nullable is an option is great, but the default feels backwards.
I’m not sure I agree with this. The requirement is “stable numbering for comments…If the comment is removed, we want the numbering for the remaining comments to stay intact.” So sequential numbers on comments aren’t guaranteed to be valid anyway, so why not accept the comment id as your numbering? Later comments are going to have higher comment ids due to how they’re being allocated.
Actually, a better example is very easy, thank you for nudging me to think about that a bit more.
Imagine Github. It has repositories, and in each repository the numbering of issues is expected to be sequential per-repo. You certainly want a stable issue number. You certainly can expect the issues to be deleted.
So sequential numbers on comments aren’t guaranteed to be valid anyway, so why not accept the comment id as your numbering?
The idea is that you have the evidence that the comment was there. And by “sequential” I did not mean “without gaps”, I just don’t know the right words. Like if you have page numbers in the book, and if a page is torn out you know that there was a page. Like a proto-blockchain.
Anyway, the exact business requirement is not much important, it’s an example of path dependence, and most examples are probably debatable per se. Maybe there is a better example for that pattern, I just did not think it up yet.
And by “sequential” I did not mean “without gaps”, I just don’t know the right words.
If you mean that, for any two comments A and B, if B’s time of posting is higher (more recent) than A’s, then B’s number must be higher than A’s, I would say that the comment number is increasing (more specifically, non-decreasing) with respect to the time of posting.
It’s great to point out usability issues with open source software, especially issues that arise in the interactions between multiple projects. However, this article contains a number of somewhat odd observations.
Outside of Element, there are very few clients, most of them very old and not being updated within the last few years.
This is simply false. I currently use Nheko on desktop and FluffyChat on Android, and both are updated regularly. A variety of other options exist.
When I close any client, I have to re-login, re-compare the Emojis, and all chat history is gone.
This is very strange. While it’s obviously bad that this is a possible failure mode, I’ve never heard anyone else report this; I’m curious if others here had the same experience.
With the unique mail-address used to sign up to Matrix, I am unfindable because only Matrix uses this specific mail address.
What does this mean? I have never used an e-mail to look up a Matrix account; everyone I know who uses Matrix just lists their Matrix ID on their social media profiles or website, or communicates it directly. I often have success simply searching for people’s habitual usernames, if we happen to be in the same large rooms.
I very much hope that Matrix and Element eventually turn into something useful.
Statements like this are unnecessarily negative. I have run a Matrix server for a long time - about four years now - and have “non-technical” friends and family members messaging me via Matrix regularly. Several communities I’m in use it heavily as well. It is useful, however much improvement it requires.
I’m glad this post is urging the Matrix team to tighten their ship, but I feel we, as a community, can make that kind of change without being intentionally and excessively negative.
I am unfindable because only Matrix uses this specific mail address.
Matrix does have support for discovery based on third party IDs. Basically you upload some hashed IDs to some service and it tells you related Matrix IDs. I have found like 2 contacts this way.
But this complaint seems a little asinine because:
They chose to use a unique ID.
You can verify as many emails as you like, so you can just add your “public” email.
I have been operating a couple of Matrix home servers for about five years as well, for personal use and for our work chat at Oxide. There have definitely been issues, it’s not perfect! But while I don’t know this person (thank goodness) I am familiar with the sort of tone in their writing: you can tell when somebody is trying not to enjoy something. I did enjoy the mixture of effusive and even obsequious responses from Matrix folks – anybody who has had customers or a community to manage can immediately sense the exhaustion behind those carefully cheerful words.
you can tell when somebody is trying not to enjoy something.
That may be true, but I think it reflects a very common attitude people are likely to have when signing up. Early on all your users will be there because they’re enthusiastic about distributed chat networks or whatever, but if your service is the slightest bit successful, most signups will be people thinking “oh great, I have to sign up for YET ANOTHER account just to talk to X”. They aren’t going to be in the mood to cut you any slack the way the first group did.
Okay, and if someone’s reaction, while in that mood, is to go on the internet and loudly proclaim “trashfire”, they should be punched in the mouth once for every word they write.
I guess I’ve never experienced this confusion, but I do find that explaining SQL in imperative terms can cause confusion. And some of those quoted tutorials are just sad.
I would explain JOIN in a non-theoretical way as: You get all the rows from the left table, each combined with the matching rows from the right table. If you say LEFT, the nonmatching left table rows will appear once, combined with NULLs. If you don’t, the nonmatching rows won’t appear at all.
Explaining this as an imperative algorithm just invites bad assumptions. The process may or may not be done by running a nested loop — in fact, if that’s what the query plan says, you may have a performance issue to fix — and those rows don’t have to come out in any particular order unless you ask for it.
I would explain JOIN as: You get all the rows from the left table
For me, “all the rows” implies that there are roughly the same number of rows in the output as it was in the first table. In the case where cross product is triggered (such as ON 1 = 1), there are 3 rows in the left table and 15 rows in the output table.
I wouldn’t say “all the rows” in this case, because that’s five times as many rows than I had.
Ah, I see the problem now! I guess I’m thinking of it in a more theoretical way than I realized. When I say “rows” I really mean “relations”, which don’t have unique identity, they’re just combinations of values. So it doesn’t imply a 1:1 relationship with the output relation. But that “multplication” behavior is what I meant by “each combined with the matching rows”.
Nit: the rows are called tuples, with a relation being the equivalent of an entire table (a set of tuples (the body) along with the heading (that all the tuples in the body must match)).
I guess I’m thinking of it in a more theoretical way than I realized.
I think that what happens here is that you actually have a correct mental model that you picked up from a text that carefully explained it. So you’re fine.
But maybe when you read another explanation, you don’t really notice that it’s not saying what you know but something completely different. Unless you’re in the proofreader mode, it may be hard to see such things. Here you need to deduce the mental model that the text is trying to establish, but you overcorrect it to your own.
But when somebody who is actually learning this stuff reads the “wrong” text, they read exactly what was said, and that’s the mental model they would be having. Maybe at some point they would realize that it’s wrong, but I can see why you can spend MANY years and not really get the fully correct picture. Because most of the queries that they would encounter would actually work! And most of their queries would actually work! It would only break sometimes, but we’re so much used to things sometimes mysteriously breaking and then mysteriously fixed. Nobody got an energy to dig down every time.
And it’s already hard for the novice: they learned from ChatGPT, then they wrote a query, it breaks, they ask for your advice, you change something in the query, and it works. What’s the chance the novice would start a “debugging session” trying to understand why their query was wrong, why your query is correct, what mental model you both had, etc.
Exactly, it’s really hard for an experienced person to recognize which wrong turn a novice has taken, based on external evidence. One reason teaching is a very different skill from doing! And a reason having a teacher (even occasionally) is better than going it entirely alone if you can manage it. (I just started learning the cello and am in that state right now!)
I’m in a database class right now and kinda know something about this.
A JOIN is a cross product followed by a select. Importantly a select is not about adding rows to the output but removing ones that shouldn’t be there. It sounds like unimportant trivia but is meaningful if you want a trully generalized explanation that doesn’t fail on edge cases.
For example: if you have a natural join where no column names match: the result is a cross product (because nothing was removed because the select had no condition).
It’s extremely unintuitive and seems like a collection of trivia, but the root of a lot of the confusing behavior comes from join being a cross product followed by a select.
So this statement:
For each row of the second table we call the match function. If it returns true, we add a row to the output. Data from all columns of both tables is available for output.
Is backwards, it wouldn’t add the data as it would already be in the output (due to the cross product), it would instead not remove it since it matched the condition. anything that doesn’t match the condition is removed.
It seems like the same thing, but it violates intuition when you start looking at a JOIN that is missing conditions.
Is backwards, it wouldn’t add the data as it would already be in the output (due to the cross product), it would instead not remove it since it matched the condition. anything that doesn’t match the condition is removed.
Unfortunately, this intuition breaks for left join. Imagine a single-row first table, and empty second table. Their cross product is empty, but the result of any left join will have one row.
Great callout. I’ll poke my instructors about it. One thing they mention is that most RBDMS have some functionality that diverges from the theoretical correct outcome, this might be one excuse they throw out.
I hate how I can’t get to a single unified “why things work the way they do with no edge cases”. I’m quite good at building and retaining frameworks but quite bad at memorizing a series of random trivia.
Unfortunately the Georgia tech database course and the textbook The Fundamentals of Database Systems are all over the place.
Doesn’t the null row only show up when there are no matches? That means outer join is not monotonic: adding a row to the input can remove a row from the output. And that means it can’t be made of cross product and select, because those are both monotonic.
The definition in @isra17 ’s comment uses set-subtraction, which is non-monotonic. Adding rows to S can add rows to (R join S) which can remove rows from R - (R join S).
No I’m just extending your model to fit the standard behavior of an outer join.
Although perhaps a better definition would be that it is the cross product and select of the inner join union with the set of tuples from the outer table that don’t exist after the select.
Oh, cool! Can we use you as informant to try and debug how such things happen in an actual university? :)
most RBDMS have some functionality that diverges from the theoretical correct outcome
I wonder what is this about exactly. One thing that comes to mind is maybe NULL semantics? Like, depending on how the request environment is configured. This sounds pretty weaseling out’ish.
why things work the way they do with no edge cases
I think that the description in the post has no edge cases. And INNER JOIN is even easier (there are some other problems with it, mental model-wise, but that’s a separate story).
I think the most commonly cited example is that the relation model specifies that relations should have set semantics, ie each tuple (row) can occur at most once in the relation, whereas SQL allows duplicate rows.
Codd himself wrote about this, and you can look up practically any text or video by Chris Date to get (good) rants about what SQL got wrong and why it matters.
I think it’s because LEFT JOIN is an shorthand for LEFT OUTER JOIN. Where the “a join is a cross product followed by a select” is for a JOIN i.e. an inner join. Which is confusing in and of its self (that you can have an outer or an inner without specifying the type exactly).
From the textbook (p264):
A set of operations, called outer joins, were developed for the case where the user wants to keep all the tuples in R, or all those in S, or all those in both relations in the result of the JOIN, regardless of whether or not they have matching tuples in the other relation. This satisfies the need of queries in which tuples from two tables are to be combined by matching corresponding rows, but without losing any tuples for lack of matching values. For example, suppose that we want a list of all employee names as well as the name of the departments they manage if they happen to manage a department; if they do not manage one, we can indicate it
The LEFT OUTER JOIN operation keeps every tuple in the first, or left, relation R in R
if no matching tuple is found in S, then the attributes of S in the join result are filled or padded with NULL values. The result of these operations is shown in Figure 8.12.
So yes, it’s an edge case, but intentionally so (in that they created a new concept for it).
Frustratingly it doesn’t reference this cross product conundrum. I.e. beyond the property that “keeps every tuple in the first, or left,” how does it actually do that behind the scenes?
Frustratingly it doesn’t reference this cross product conundrum. I.e. beyond the property that “keeps every tuple in the first, or left,” how does it actually do that behind the scenes?
Well, wouldn’t you say now that the algorithm description in the post makes it all clear?
this is a very interesting statement, I did not think about that from that angle.
I am fascinated at people who are otherwise security conscious thinking that Google would ever call them. All of these stories, the step 1 is always just such a red flag for me.
I don’t get how you fall for this yet also write up a whole thing like this with all the “right” details. Being able to do the latter seems like something that would preclude you doing the former
I’m thankful the OP took the time to write this all up afterwards and share.
Just in case, i didnt write this, just saw it being shared and felt it could be useful here
It’s like that Mike Tyson quote: “Everybody has a plan until they get punched in the face. Then, like a rat, they stop in fear and freeze.”. It’s just different when it’s actually happening to you, instead of you calmly reading someone else’s experience.
People get flustered, or they start panicking about the “crisis”, or their instinct to not be a nuisance kicks in, or they can’t effectively split their attention between the safety checklist in their head and what’s being said to them (which they fear might be a genuine problem).
A well-executed scam & scammer will also take advantage of anything they can to build trust. We see in this story that the author ran a number of checks, which the scammer passed to the victim’s satisfaction. That leaves the victim uncertain and vulnerable to the scammer steering them through key points where the victim could break off: the author thought about hanging up and calling back, and chose not to.
This is not the first time I’ve read somebody who should know better get (almost) taken in. A scammer can weaponize your intelligence and education against you, because the real vulnerability is the human factor.
Another thing that could happen to you is a temporary health-related effect on your cognition. I was successfully led by unfortunate combination of “almost recovered after some flu” and “making account on a new service that provides financial transactions”, but thankfully my bank’s anti-fraud system prevented the $1000+ transaction. Next day I was not even sure what I was thinking then, it was so f- clear.
You can also have side-effect from some medicines, not to mention alcohol.
Smart people can still be manipulated. I think these attacks get a lot of mileage out of the initial adrenaline rush you get when you think you might be being hacked.
Something that stuck out to me was that the author looked for reasons to buy the scammers’ story, rather than reasons not to. There’s a case ID, it’s from a google.com domain, the number is listed on a google.com page—but if you operate in that direction you stay on the scammers’ happy path, so to speak. I think we need to do a better job here teaching people to be suspicious.
But here’s the thing. If it had been FedEx calling you asking for import duty payment, they’d have behaved in exactly the same way, down to the obscure help pages and the dodgy URL you’re expected to type sensitive information into. I think any reasonable level of suspicion would force you to classify their communications as scam attempts.
The author mentions the “best practice” of verifying the phone number, but gets it wrong. You don’t verify the phone number. You call the company using a phone number you looked up yourself. (If you’re on a landline, call someone else first). The point here is not to verify anyone’s identity—the FedExes of the world impersonate scammers so effectively as to make that impossible—but to establish a new line of communication that you can trust from first principles.
Yes, caller ID is meaningless. The author actually got to the point of doing the right thing (calling back) but didn’t. It’s a good example of how everyone has blind spots — someone who knows how to check DKIM but trusts caller ID!
Which is kinda funny because in modern tech orgs the Security team is often seen sending out test phishing attempts to see if employees fall for it and then roll out mandatory security training courses for all employees, and this approach is apparently widely hated.
It’s hard to put it all in one category. That kind of training can vary widely in quality. It also depends on what follows - is it a result for the IT to work on, or a scapegoating opportunity for management.
You probably (and hopefully) don’t intend this as hurtful or shaming, but reactions like this are precisely why stories like these often remain untold, allowing scammers to stay under the radar longer.
Almost no one is inherently stupid but everyone does stupid things occasionally.
There’s a lot of psychological bias at play here. When it happens to you, it’s bad luck, or maybe you weren’t thinking clearly for some reason. When it happens to someone else, it’s easy to think they’re stupid. But I don’t think that’s always valid. I consider myself pretty intelligent, but I’ve done some pretty stupid shit – usually when tired, emotional, or distracted. Scammers are aware of this and exploit it by creating distractions. Telling someone their Google account has been compromised can push them into a panicked “I need to act now” state.
Sorry, I was too mean in the original post. Appreciate the message on this.
My thought was a bit more about how the specific knowledge that leads to you to do the whole “look up the phone number and ask them questions” things … it feels like if you would have the knowledge that Google would not call you in an unprompted manner, like ever. And perhaps “I almost fell for it” is stylized, given how they seemed to have so much suspicion from the get go.
And hey, “Google will never call you”… but …. maybe this one time is true! It’s an easy thing to think.
I am fascinated that I have on several occasions been phoned by Microsoft Support. The real Microsoft Support, no scam involved. I sent in a support question using their form, asking for how to do something with appsource, and I thought it annoying that they required my phone number and then not much later some guy with an Indian accent actually called me and told me that sorry we don’t have that feature and I apologize so very much but here are some workarounds. You’d think Microsoft would stop doing this kind of thing to avoid people getting tricked into thinking it’s them when it’s the scammers.
Another reason I just never answer my phone.
Step 1 was obviously a red flag for them too, but the scammers managed to evade all reasonable attempts at exposing the origin of the call as a scam.
Our default human instinct is to trust that people are telling us the truth. Unless we are already primed, thoughts of deception are usually a secondary follow-on thought.
It’s a lot like cults. Everyone thinks “I’d never join a cult”, but the truth is that the potential exists in everyone. This is an always-open attack surface, for even the most security savvy tech person, and if we’re honest, the sophistication of these attacks has really ramped up in the last few years. The “herd immunity” has not really caught up.
The part I take exception is to the line “[I followed the best practices] of verifying the phone number”. The best practice is not to verify to phone number. The best practice is to is not take action on any contact you yourself have not initiated, and this difference is why this measure failed to protect him. If he had ended the call, called the verified number*, and then gotten agents to look up his ‘case’, that would have revealed the deception.
*(yes, I know that getting a hold of google is difficult)
Isn’t CPS basically just generator functions without syntax?
I’m not sure what exactly generator function is, but for me CPS is just multiple stacks made explicit.
That’s an interesting way of putting it. I think the same description could apply equally well to generators in JS and Python
One thing I’ve never understood about 4NF and I can’t find any good explanation: How is it not slower than using an array-typed column? When you want to find out what skills a given employee has, with an array-typed cell you can just look at the value, whereas with 4NF, you have to do a linear search over all pairs of (employee, skill). Tutorials always say “don’t worry about performance”, but don’t explain why.
The explanation is very simple: indexes. In the example above, the actual SQL table would look like this:
This sets up two indexes (one index underlying the PK, and another index is explicit on
skill_id) that together handle all the necessary searches/joins, and the primary key additionally handles uniqueness of each pair.Here is the best book on indexing: https://use-the-index-luke.com/
I hope @lproven will arrive with references but I am inclined to share my impressions.
The two earlier database models were hierarchical and network. I don’t have a good example of a “network” style but I believe COBOL’s records or the Mumps data structure are decent representatives of the hierarchical model. Even once relational databases started to exist, at the dawn of the PC era they were too heavy for practical use on small PC machines, so early consumer databases like dBase were heavily record-oriented. As a result, it wasn’t uncommon for “database” or “file” or “table” to be somewhat synonymous, and having structured or semistructured data embedded in a field happened frequently. The usual way of working with these early databases was to iterate records and do something with each of them. This is also the era in which the meme “joins are expensive” gets started.
I wouldn’t be surprised if a lot of weird non-relational ideas, such as those presented in the article as “non-4NF” relations, were carryovers from this era. The assumption wasn’t that you were writing SQL queries, the assumption was that you were iterating records and building a report by hand. A new file/table may have been very expensive or computationally difficult to work with, and even when that ceased to be true, the notion carried forward.
One of the first things we learned in RDBMS class was that tables are not inherently expensive, and joins can be cheap. You need to start with this mentality or you’ll produce really weird non-relational designs.
I’ve found a plausible explanation of how this process made sense back when people were migrating from hierarchical to relational databases: https://www.cargocultcode.com/normalization-is-not-a-process/ (“So where did it begin?”).
Update: haha, you and me were discussing this three years ago in https://lobste.rs/s/lliti0/making_sense_1nf_pt_ii And you mention Postgresql arrays too.
The explanation you provided is superb, thank you!
Lol, I guess some things never change :)
There definitely seems to be some kind of “The Beatles are overrated” effect going on, where 4NF was so successful that it became the default, and it’s really hard to imagine the things it was a reaction to, and why anybody ever thought they were a good idea.
I’m not old enough to remember the time before relational databases, but I suspect there was a long time when a “database” meant a single table, and even when a database supported multiple tables, there might have been a time when having an entire table with just two columns felt frivolously small. I’ve definitely worked with CSV files that had almost identical data on consecutive rows with just one field changing, to represent a list of values. And I’ve worked with databases that put a comma-separated list of values in a VARCHAR field. Those structures must have come from somewhere.
I can’t substantiate this but I believe they come with culture created by databases being created as (literal paper ones, or digital) spreadsheets. Shoehorn that into a file, and you now have historical reasons.
I am not sure that’s entirely true. The way this weird composite design is supposed to work would not even look useful as a spreadsheet (or as a hand-written announcement on a piee of paper, or whatever). It is deeply weird.
I was thinking about how much effort it would take to just write the correct code that kept the data coherent in Kent’s five variations (mentioned in the article), and the amount of cases to handle is impressive.
This is of course purely intellectual game now, like Brainfuck programming, but I’m very much interested in dismantling this, because student’s time should be put to a better use…
I agree with the skepticism but will note how very many little extra data-coherency-preserving steps accumulate over the years in paper/simple-digital processes, the kind where a new hire spends weeks learning all the little process intricacies specific to that role.
To me that “convoluted design” looks exactly what you get when people computarize index card based physical archive storage. Physical archives that I have seen (granted, only one, but people maintaining it were experienced with multiple) were index cards with set amount of columns which contained potentially repeated values in some of them (think about key-value pairs with repeated keys). If card became full, new one was created which continued the “record”. When all of the data was stored in physical archives and digital databases were new and unknown thing, it does indeed make sense that first idea was to just replicate existing data as it is.
Now, digitizing that kind of archive is surprisingly difficult task. Data entered and digitized version of it (with OCR or by hand) will contain errors because of the human operators and because there isn’t hard database constraints. So going straight into schema with strict constraints is practically impossible. And when you haven’t seen or heard about strict schema databases, imagining such thing requires one to be quite a visionary and even then practicality of such thing sounds dubious at best.
I believe that the common advice of “always store time in UTC” is a bit overplayed. There are many cases where you actually need to store local time (AND a corresponding timezone name). Basically when you handle any time in the future. Only rarely you actually need “now + X * 86400 seconds”, most of the time it’s “October 10, 2030 at 20:00 German time”.
I would say that storing this as an integer: 20301010200000 would even make some sense because it will discard the time-related assumptions.
I rarely store timestamps in the future. I cannot recall when was the last time I needed that. And for times in future (like calendar events) I agree, but that is super rare case that you need to store such timestamp.
Leap seconds are another reason not to store future times in UTC, though it seems those might be finally going away. UTC always for past times is probably very reasonable, though.
UTC is for timestamps. When something actually happened.
Rarely for things like “expire in 24 hours”.
I mean we all understand how many times people were burned by timestamps stored in local time, so this overshotted advice seems to be kind of a valid precaution.
You can safely ignore leap seconds in basically all cases, e.g. UNIX time does not include them. A day in UNIX time is always 86400 seconds, and in practice clocks are adjusted around leap seconds with leap smearing, or just setting the clock back 1 second at the arranged time. If you actually care about highly precise timing, you have to use specialized libraries that are aware of leap seconds. Most things ignore them, and mostly 1 extra second of real time here or there doesn’t matter.
That 1 second matters even less when you consider that all clocks you have access to through conventional APIs are some degree of wrong—either non-monotonic, unsteady, or both.
That said, I agree UNIX time shouldn’t be used for future dates/times that have any calendar semantics.
The only way to avoid leap seconds is to use TAI internally, and keep a separate table for the UTC offset (kind of like timezones). Not sure what applications need second-level accuracy in the mid-to-long future, though.
I have always just stored times as epoch timestamps and found it infinitely easier. What is the downside?
Epoch time is just another representation of UTC, as far as I’m aware. So if you’re storing future dates then you have the downside described in the OP.
Oh, of course. Thanks.
An epoch timestamp still wouldn’t be able to recompute to an intended date if the underlying timezone/DST/similar assumption changes.
I worked on some open source package because I wanted some functionality for my own non-OS but not commercial project. But as I started contributing, I found a lot of friction in setting up the development environment. So much in fact that I felt that just powering through may be too stressful.
So I spent several months fixing the onboarding experience: upgrading to newer generation of packaging tools, adding proper diagnostics that are only relevant on a clean machine, fixing README’s, etc., etc.
In parallel I submitted my own changes and I think that both activities were reinforcing each other: I’m not sure if just my own changes would be accepted as readily as when I already demonstrated proof of effort.
My point is that IMO it’s a very good focus point for an experienced dev who is not obligated by deadlines etc. Work as a sort of icebreaker ship, reduce the barrier to entry somewhat, submit incremental changes. Anyone can add new features, lol; and fixing bugs is usually motivated by people who are affected by this bug. Fixing DX is often deficit activity.
No.
Everyone who’s a big fan of these things talks about how you should spend a few hours (8-10 I think?), so I’ve spent about that much time really trying to get a feel for them and I think they’re mostly bad.
Occasionally there’s been some utility in using them to generate shell scripts or simple fragments of Python programs, but there has never been a net benefit in terms of time saved for me when you account for hallucinations sending me off on wild goose chases.
One thing I did realize recently, though, is that the hallucinations end up being way worse than wrong StackOverflow answers simply because of how much confidently incorrect supporting documentation gets generated alongside them.
Here’s an example of something I built with Claude this morning: https://tools.simonwillison.net/svg-render
You can see the prompt sequence I used here: https://gist.github.com/simonw/b06fd62ad4e9f8762ad15cdf17e1be85
Does that help at all?
Not GP but thanks for posting this, it was very useful to see how others handle tasks like this!
I wonder if there is a way to quickly see the actual diff between each version.
On occasion I’ve pasted each version in sequence into a Gist and clicked save just so I can use the “revisions” tab in the Gist interface to get cheap diffs against them. I wish Artifacts had that as a built in option (same request for the new OpenAI Canvas feature, which also lacks visual diffs).
looking forward to the prequel blog post, “how we ended up with 2800 microservices”
I think it basically comes down to having tools/technology that makes adding a new service very low-friction. And by that I mean a new service just gets a huge amount for free: database/queues are available, observability systems are available, tooling for deploying/scaling are available, etc. Scaffolding service boilerplate is just a single command.
There are pros and cons of this architecture of course, but I think the “complexity” argument that is commonly levelled against this architecture is heavily mitigated by using highly consistent technologies (and something we are able to retain by running migrations in the way I describe, where we step from one consistent state to another).
Are those 2800 different microservices or just instances of the same few hundred microservices? A simple search tells me that Monzo has about ~300 engineers, so that’d be roughly 10-to-1 services-to-engineer. I imagine not all of those 300 engineers are service maintainers either so it’s probably even more? How sustainable has it been in your experience?
2800 distinct services.
I think that’s roughly correct.
I haven’t really found it to be a significant problem in practice. I’ve worked at a different company where I was responsible for fewer services (maybe ~10 for per team), but it was significantly harder to maintain because we lacked the high degree of consistency that Monzo services have.
Probably the biggest downside of this architecture is the fact that implementing transactions/locking is done at the application layer. But we have good off-the-shelf infra/tools that make this a lot less painful than I’ve seen elsewhere. On the flip side, product teams don’t have to worry about provisioning or maintaining any of the infra/storage themselves.
How do you approach testing flows that cross services? Eg: “Given that I change service C that is used in business flow where service A calls service B which calls service C; ensure the end-to-end functionality is not broken by my change”?
The main technique we use is what we call “acceptance tests”: essentially a test that tests business logic against a whole system. These are run periodically in our staging environment as well as being triggered by deployments.
A technique we have for testing in production (and staging) is “shadow testing”: before enabling a feature in production, we’ll often run it in “shadow mode”, automatically comparing the responses with the actual responses and alerting if there is a discrepancy. We have libraries to make this easier to implement.
These are definitely all more clunky to implement and than unit tests, but they’re also not something we have nearly as many of vs unit test (higher up the testing pyramid), so we haven’t (yet) needed to improve the DX around this.
Also worth noting is that client code imports the generated protobuf sever code, so breaking API changes are caught at compile time.
how do you organize your OpenAPI wrt readability? How many endpoint do you have per microservice? do you have unified OpenAPI UI that I can read through, or do I need to read it per-microservice?
I’d guess the median number of endpoints per service is around 10, but it’s a long tail distribution.
We use protobuf to specify the API. Clients of a service actually import the generated protobuf go code, this means we get compile time checking and also some nice developer experience like being able to “jump to definition” for RPC calls.
I am not Id accept this answer as a legitimate justification, but I get that once you pen that door theres no way back for better or worse
I read the title and remarked to myself, when I was at an investment bank they had more than 4500 services tied together with centralized message queuing. Not knowing the “monzo” brand I thought, what are these people up to that they have as many as a bank? stfw. Oh, they are a bank.
I’ve chatted to a few Monzo engs in the past and it basically boils down to blurring the lines between what is a unit of business logic - as willsewell mentioned, their tooling essentially abstracts away a lot of the complexities of inter-service communication in such a way that writing a package that imports another package and calls a function feels remarkably similar to writing a service that calls a function over the network.
Which is, given they have the resource to do that, pretty neat. And I use Monzo daily as my primary bank, it’s rarely let me down!
If you’ve not seen https://qconlondon.com/presentation/mar2023/banking-thousands-microservices and https://archive.qconlondon.com/london2020/presentation/modern-banking-1500-microservices are two talks they’ve done about it
at this rate, they will be running > 4000 microservices by 2028
This is outside my area of expertise so maybe I missed something, but it kinda seems that this whole post is based on a flawed premise?
That is, the post seems to be assigning blame for timing leaks on C compilers, when AFAIU both the C language and mainstream CPU architecture documentation make no promises about whether the usual language constructs and CPU instructions have any guaranteed timing?
In other words, it seems to imply that timing leaks occur because C compilers introduce branches when AFAIU timing leaks can in theory occur for any reason whatsoever with almost any language construct and CPU instruction.
From a quick search, I found this document which seems to imply that Intel CPUs have a special mode for executing some CPU instructions with timing guarantees, but from what I understood, the approach outlined in the post doesn’t even attempt to use such guaranteed timing modes, instead trying to achieve the same thing with portable code, which seems like a flawed proposition to me…
To make a bad analogy, the approach in the post seems to me like trying to prevent a leaking boat from sinking by relying only on navigation…
IF(!) this is accepted as a given, natural and non-negotiable, then the question arises: how we, as an industry, are supposed to write cryptographic code in practice?
We have two major choices: first, accept that it’s impossible to write crypto code using the commonly available C compilers. Second, attempt to renegotiate with compiler writers their approach to introducing compiler changes.
An argument in favor of that is that “make no promises” part is not a law of physics but a decision made by people. Maybe they could reconsider that decision along the lines that the cryptography people suggest.
I think that’s a false dilemma and that there is a third choice: introduce an implementation-specific (or even standardize it at the C language level) new special mode of compilation for certain C functions (say, enabled with a function attribute), where only certain language constructs are allowed, and which forces C compilers to use a special mode of compilation that is guaranteed to use the special data-independent timing CPU instructions such as those in the Intel document I linked to.
I think this is the most reasonable architecture for achieving the goals that the blog post implies they want to achieve.
It would avoid crippling C compilers in all the rest of (non-cryptographic) C code which doesn’t need those guarantees, but would still allow to write reasonably-portable cryptographic C code that is guaranteed to be free of timing issues.
yes, absolutely! I am actually surprised that this is not available (and I suspect that there is a big can of worms there). But I’m just a bystander.
At least in LLVM I think this would require some substantial new cross-language machinery beyond just the C frontend. The backend doesn’t keep track of constant time properties / data-dependent timing, and many of the issues aren’t C specific (the optimizations can also apply to C++, Rust, etc). For example somebody noticed a few years ago that the x86 backend decides when to emit branch instructions vs cmov based on a cost model, but that can introduce data-dependent timing if it picks the “wrong” one.
That route would mean one of three things that I can think of:
This is pretty bad all around.
Define “crippling”. How much performance would actual programs be leaving on the table? Can you, or anyone else, say confidently enough that the amount is anything beyond utterly negligible? It’s easy enough to imagine measurable performance increases in some micro-benchmarks, but it’s not clear to me that the authors of those “optimisations” have realised that they were increasing the load on the branch predictor, and could very well be increasing branch misprediction as a result, not to mention the possible increase in energy consumption. A tiny effect for sure, but one that might very well offset the benefits.
What’s actually crippling is letting questionable optimisations significantly reduce the applicability of the language. And no, you don’t get to blame the victims with “but but but the standard never said you actually could”, when in reality they successfully did, and compilers started to take that away.
In the end, that’s probably what will drive me away from C. There’s only so much undefined behaviour I can tolerate, only so few guarantees I can survive off on, before I decide that Clang and GCC have gone crazy and I start to run away from anything they touch — which pretty much means all of computing these days, but I’m already tempted to bootstrap everything from FPGA & open source synthesis.
I just want to point out that maybe this was even a “false dilemma”, and there is a third choice (though I said “two major choices”, not the “two only choices”), but it does not make the first two choices non-existing. We can still choose to accept or choose to renegotiate (and we can also choose to standardize new pragma, as you point out).
There is a related debate: can the compiler remove range checks due to Undefined Behaviour? Compiler people insist that it “can do anything including rm -rf / yadda yadda yadda”. Security people are kindly asking to stop doing idiotic things and introduce security vulnerabilities.
The logic is I believe the same: some people decided that they can remove range checks explicitly present in the code due to some wording in language standard. They can undecide that.
I think the current (c/c++ compiler) interpretation of what is permissible when “discovering” UB is not reasonable, in any circumstance.
However there’s a much bigger problem in that C and C++ grossly overuse UB - the majority of UB, and certainly the UB where removal leads to security and correctness (w.r.t the host platform not the language spec today) has no business being UB - the C and C++ spec both have “implementation defined” and “unspecified” as a mechanism for describing behaviours that are not consistent across all platforms while also not being UB/“error”.
In WG21/C++ we’ve recently introduced “erroneous behaviour”, which is basically a compromise solution - there are people who want things like integer overflow to be an error, but if it’s unspecified or implementation defined, then overflow is “correct” and an conforming implementation could not report an error. Hence EB: “this is unspecified/ID, or you can report an error”.
I was pleased to see that C is also starting to remove gratuitous UB — “slay some earthly demons”
Speculating from my understanding of the C standard, there are three major obstacles to defining behaviour:
Not all those obstacles will apply to all UB, but overall it’s a fairly tall order.
Undefined Behavior is “you cannot define any behaviour”, the existence of a single platform where the result of a behavior is is not sufficient for the behavior to be UB on all platforms. It’s for “you cannot in any reasonable model guarantee an outcome”. For example OOB access, dereferencing (int*)random(), etc has no coherent model. If one piece of hardware has “this operation randomises the cpu state” but every other platform is able to produce a self-consistent outcome, there’s no reason to say “this must be UB everywhere”.
These are literally the exact reasons implementation defined and unspecified behavior exist. The requirement for ID or unspecified behavior is not “every platform behaves the same way”, it is that a platform has a consistent behavior e.g. trapping on division by 0 is acceptable, producing a signal value is acceptable, the requirement is just the division by 0 always behaves the same way. ID and unspecified behavior exists specifically for the purpose of describing behavior that differs between platforms but is not undefinable.
I mean if you introduce a new behaviour definition that drastically changes the language complexity, or introduces widespread security bugs in existing code, in exchange for optimizations that only meaningfully improve performance for a set of benchmarks or small group of very specialized pieces of code, is that a reasonable choice?
Okay, first, I mostly agree with your objections. In fact I’d support them myself: it’s kind of ridiculous that just because something is undefined in ARM it can’t be defined in x86. It’s kind of ridiculous that trapping behaviour isn’t brought into the realm of, if not defined, at least implementation defined or unspecified. And it is utterly ridiculous that (runtime) performance ends up being prioritised against correctness to such an extent.
Now a couple details.
The standard doesn’t seem to use them that way, currently. Anytime anything traps, it is put in the undefined realm for whatever reason. The second trapping is a behaviour that they accept to define, is the second it can be used for implementation defined or unspecified behaviour. I would love that to happen. Some work on the exact wording would likely be needed, but I would love stuff like division by zero to stop being undefined like it currently is.
As I said, utterly ridiculous indeed. I can see two biases leading to this anyway: the first is that benchmarks are right there for all of us to see. Performance is easily measured and quantified in very precise ways. It’s even a bit addictive how it plays out, I’ve seen it happen in my own practice when writing my cryptographic library: optimising EdDSA was quite the power trip.
Now we can see the vulnerabilities as well, but then the second bias kicks in: status quo in the letter of the law. See, compiler writers are respecting the standard when they introduce insane exploitation of obscure UB. But it’s written in the standard they’re within their right to do so, and they won’t hesitate to blame the programmers for relying on UB almost nobody knew about. It’s victim blaming, pure and simple, but with a twist: the law, written by compiler writers, is siding with compiler writers. It feels like “considering this exemption from November 1936 stating that short skirts means consent, and since we’ve established yours were showing your calves,, the court decides that (1) all charges are dropped and the five accused citizens are free, (2) court’s fee is on you, and (3) we are now pressing charges for indecency.
I mean you’re right: sometimes the law is bad and needs to change.
At first I kinda agreed with “people can undecide that”, but I think the problem is that for timing bugs it might be equvialent to “we can’t do any optimizations on any code”
Compiler developers are probably simply unaware of when they’re introducing timing vulnerabilities … it can happen anywhere, and new kinds of vulnerabilities are found on a regular basis
So it’s more like a cost-benefit analysis, and yeah probably some people take it overboard and chase speed at any cost
There probably need to be new compiler modes / new spec language introduced
In most platforms, the arithmetic operations cryptography cares about is are all constant time. On some platforms multiplication might not be, but overall it’s not so bad. Thus, we’re asking only one thing from compilers: don’t introduce branches where the original code didn’t have any. Conditionals, boolean logic, even comparisons are fair game, but numeric shenanigans should stay numeric shenanigans.
But they are aware when they introduce branches where there was none.
Not quite. Timing is very specific: just don’t let information flow from secrets to timings. This means don’t let the timing of any operation depend on any secret. In practice in a modern desktop CPU, this means avoid secret dependant branches, and avoid secret dependant indices. That’s about it.
Now in theory, compilers could indeed rework the computation of indices such that they depend on different sources of information than they did in the source code. My understanding is that those would not optimise anything at all, the fastest way to compute something is not to introduce spurious dependencies. Branches are basically the same: existing branches have existing dependencies, and compilers generally have a vested interest in keeping those dependencies to a minimum, just so the branch could be computed most efficiently.
Really, all compilers have to do here is stop introducing branches where there was none.
Not that regular. And almost all involve a leak from secrets to timings: like speculative execution of code the attacker is not supposed to observe influencing the timing of a process the attacker has access to.
Those would certainly help. Writing constant time selection for instance is a hassle, I’d rather just write my lookup more directly, such as
CONSTANT_TIME_LOOKUP(array, index). The bigger problem here is that something, some important thing that used to be possible, is becoming less and less so.But they’re not, you just think they are.
This document, for example, makes it clear that actually having data operand independent timing would require disabling certain CPU features such as data-dependent prefetching and other such features which have a performance impact (and in future processors, increasingly so).
If you are not disabling these features in your code (and you are not, because you can’t, unless you use DOIT on newer Intel CPUs which probably is not even possible to use in userspace yet) then the instructions that you think are constant time, aren’t.
Even then, when using instructions with data-independent timing you might still be vulnerable to other data-dependent side channels such as power, thermal or frequency-based.
It looks like you misunderstand what cryptographic code actually requires. When we write constant time code, we don’t just avoid secret dependant branches, we also avoid secret dependent indices.
The notion was originally popularised by attacks on AES, which was accepted on the bogus notion that memory accesses were constant time. Turned out that caches are a thing, and cache timing attacks could recover an AES Key in milliseconds or less.
One might think the cache was the problem, but that’s not quite it: the problem is that the memory access patterns of a naive AES implementation ultimately depend on the secret key. To run securely, an implementation needs to stick to data access patterns that can be guessed from public information only: the length of the text is fair game, the ciphertext is fair game, but the secret key and plaintext are not. And in practice, modern algorithms use data access patterns that only depend on the length of the input text. Of the top of my head I can cite Salsa20, ChaCha20, BLAKE (all versions), SHA-2 and up, Argon2i…
So yeah, the CPU will indeed do prefetching magic and whatnot, and it will indeed use the value of the index, or anything this value depends on, to fetch that data. But if the value of the index (or pointer) doesn’t depend on any secret, then the resulting timings won’t either. They’ll vary for sure, but the only info they’ll leak would have been public to begin with.
Even if the CPU doesn’t depend on the actual value of the index, but on the values it might take on speculatively executed branches, in actual cryptographic code such indices are still secret independent, and the timing information would still only reflect public data.
The prefetching system would need to use a seriously overreaching heuristic to start depending on actual secrets even though there is no data flow from those secrets to the index regardless of branching. In fact, I currently seriously doubt the usefulness of such an overreaching heuristic: whatever it does has to run reasonably fast to even prefetch in time, so it probably makes sense to stick to values the index takes in actually (and speculatively) executed branches.
Most importantly though there’s no way in hell CPU designers aren’t already aware of this problem. They have to know letting such an surprising timing leak that way will inevitably lead to a stream of hard to mitigate vulnerabilities. They can’t just hide behind “timings aren’t part of the ISA, not our fault cryptographers don’t know how to read a data sheet”. Well I guess they can, but then they really deserve to be sued.
Let’s get real, outside of smart cards or similar, fancy side channels such as temperature, power, even electromagnetic emissions, only matter to the extent they actually influence timings. The attacker has to access those side channels one way or another. If it can read your sensors, I guess it might exploit them.
Frequency throttling based on temperature however, that’s a hard one. Hopefully it’s low-resolution enough that the power variations resulting from the repeated arithmetic operations of a cryptographic primitive are too low-resolution to leak any meaningful information, but I do confess I am slightly worried about that.
I don’t think I misunderstand, I’ve known about what you wrote for many years already.
What you wrote is only a subset of what is written in this document, for example.
That document also contains other details which you have to get right so as to not leak secret information in your code.
They are, hence why the document I linked to, as well as the Intel DOIT document and other related documents (e.g. frequency side channel mitigations, etc) were created.
They do know. Why do you think Intel DOIT was created in the first place?
Note that what you wrote is one of the prerequisites for using Intel DOIT. That is, on top of avoiding secret-dependent branches and indices, you also have to use something similar to Intel DOIT to avoid secret-dependent timing leaks. Think about it. If that were not true, why would Intel DOIT exist in the first place?
I think you are underestimating how complex modern CPUs are.
Yes, and the temperature is correlated with power usage, so if certain secret keys cause more power usage than others (even if all the code is constant-time), you still have the same problem…
I’ve just read it from start to finish. Excellent document, I learned absolutely nothing. These are just the classics I had to go through writing my cryptographic library. I did like the formalisation in “Secret Independent Runtime”, “Secret Independent Code access”, and “Secret Independent Data access”.
My first guess is, they made questionable trade-offs before they became actually aware of security problems. Hence stuff like SPECTRE. If now they’re doing stuff that is screwing code that is already SIR/SIC/SID, I would suggest they stop. Maybe extend the notions of SIC and SID to speculative execution as well (as far as I know my code and libsodium’s are both okay on that front — apart of course from being written in C). but beyond that? Please stop.
Or at least wait for cryptographic accelerators to become ubiquitous, even if it means waiting for 20 years.
Then perhaps you missed the link in the definition of the SIR principle:
If you follow this link, you’ll realize it’s a link to the Intel DOIT document. That is, the “instructions whose latency is invariant to the data values” is not what you thought it was, based on your previous comments.
Many software developers think that instructions such as ADD have data operand-independent timing, but the actual hardware vendor of the CPU is telling you that this is not true (in normal operation), i.e. you have to set a special bit in the CPU that disables certain CPU features, in order to actually access such constant-time instructions.
I agree. I understand that historically (when CPUs were simpler), what you wrote was true. But right now I’m only pointing out the current status of this situation, taking into account the complexity of modern CPUs.
Well, I understand this sentiment, but I don’t think the need for speed will end anytime soon, especially since CPU frequency increases have slowed down considerably (compared to historical trends)…
It’s a basic fact of life that faster / more efficient CPU = way more CPU sales.
Edit: also, this part of the document validates some of the points I’ve brought up and it also offers… well, an interesting suggestion at the end:
They’re insane.
64-bit multiply on a 32-bit CPU I can maybe understand, but addition? They’re insane.
Compiler writers are insane too.
I mean, sorry about the curiosity stopper, but this is all going way, way too far. We should rethink it all from the ground up at this point. Thanks for your time.
This is a common complaint but I think this is already easy to fix if you’re so inclined. The trick is to realize that you don’t have to rely on the default optimization options of C compilers. There are already C compiler flags that you can enable not only to preserve those checks (e.g. by converting signed integer overflows into defined behavior, say with
-fwrapv), but also to automatically introduce checks whenever (say) signed integers overflow (e.g. with-ftrapv) or whenever you try to access buffers out-of-bounds.The reason you don’t see Linux distributions enabling these compiler flags for building C software is because these checks cripple performance, which is also why C compilers behave this way by default in the first place. But again, if you’re concerned about this, you can just enable these options when compiling. This is even easier in distributions such as Gentoo or NixOS, in which you can easily choose what compiler flags to use for compilation (at the cost of having to build software locally instead of relying on distro build servers).
The problem is that if you write code that is only “correct” if you use compiler arguments then your code is not conformant C/C++, and on compilers that don’t have that flag your “correct” code may become incorrect.
The real kicker though is the interpretation of what a compiler is allowed to do in the event of UB - because as you migrate to a different compiler you suddenly have entire code paths removed
If you want your code to be portable then you can still write standards conformant C/C++, and if you want, you can even do signed integer overflow with defined semantics since it’s possible to do it in portable C/C++ (using compiler intrinsics for speed when they’re available, and I think the newest version of C also have these as built-in operations).
Then you can additionally use
-fwrapvif available, just in case.No, you cannot write standards conformant C/C++ and then use
-fwrapvif it’s available, just in case - the whole point of this discussion is that if fwrapv changes the behavior of your program at all, your code is non-conforming. The problem is the broken interpretation of what “UB” means, especially given the compiler can only actually do any of these optimizations for the specific cases of UB that have no business being UB in the first place. The modern compiler interpretation of UB means that any path, no matter where, after inlining (including LTO) that leads to “UB” - even if the behavior is well defined on the target - can be arbitrarily removed.e.g
Can be “correctly” compiled to remove the printf. More over, if you replace the printf with say, “myErrorHandler()” that could be removed anyway - only in C11, pretty much a decade after the introduction of “the compiler can say UB never happens” did an actual standard attribute get introduced to make it possible to guarantee the compiler would not be able to remove it.
In this case it is super obvious that it has happened, but it does not need to be, and frequently is not - C’s copious overuse of UB, coupled with a adversarial interpretation of UB means you cannot reasonably say any large codebase is free of UB.
The only way to remove this nonsense interpretation is per-compiler, non-standard, mode flags - the linux kernel already does this specifically because of the endless security vulnerabilities this interpretation results in.
But let’s get to the other part of this: when do these “optimizations” actually occur? Only in cases where the UB can be proved, in which case by the definition the compiler is using (The removal of code paths leading to UB is only valid under the interpretation “any code that causes UB is erroneous, and so can be assumed cannot happen”) should be producing an error during compilation.
The only place they consistently do not result in security bugs, and produce beneficial performance is in for loops, and that’s the scenario that actually introduced this nonsense interpretation of what could be done in the face of UB. Specifically a bunch of major compiler/codegen benchmarks use for(int ….) loops, where you can ostensibly optimize in useful ways if you pretend overflow is impossible (constant folding, etc). The fun thing is this is why C++ at least eventually made unsigned overflow only stop being UB: you can make that change without hurting benchmarks. I’m not sure if C has done this yet.
Yes, in this scenario the point was to try to write conformant C/C++ and the
-fwrapvis only there to catch accidentally non-conformant code. This does not mean that your code has to rely on the behavior of-fwrapv, it only means that the-fwrapvis there to catch unforeseen bugs (when this option is available).So in this scenario, your accidentally non-conformant code is incorrect with all compilers, it’s just that in those with
-fwrapvthe consequences would be less serious (if you are to believe what most people say about UB, that is).Well, consider this slight variation of your code:
In this function, when
CONFIG_VALUEis zero the compiler is free to remove theprintf()but there’s no reason why this should result in a compiler error, because this code may well be 100% correct and never callf()with zero as an argument.If code such as this would result in a compiler error, it would just produce a bunch of false positives (especially when using macros and inlined code, I suppose) and I don’t think you (or any other developer) would like that.
Unsigned overflow is not UB in C and as far as I know, it never was…
This is the literally the absurd compiler interpretation of UB we’re complaining about: a person has written code explicitly to check for a value, and the compiler is saying that path cannot happen, not because it has proved the value cannot be 0, but because later on the value being 0 would be UB, and therefore an error. So the correct thing to do would be to diagnose (and for most serious projects any diagnosis becomes an error) that this code path is statically determined to lead to an error. If the compiler is unwilling to report an error, it should not pretend that the code cannot happen.
I would 100% take any amount of compiler errors over the compiler introducing vulnerabilities and errors based off a known-false heuristic. More or less every C/C++ project I have worked on compiles with warnings are errors, and more or less every possible warning enabled.
That said, you’re right, that much noise would be annoying, and maybe it would get compilers to stop trying to use this absurd definition of UB. The issue is not “that above code should produce an error”, the issue is the compiler has decided to say UB is an error, an therefore any code path that implies later logic would be UB is not possible, with no logic to support the claim.
Ah so this was just an obnoxious C++-ism, and we just have to continue to deal with the gross obsession with retaining it for signed integers in both.
The compiler has to prove that the code it emits (for whatever target) satisfies the semantics of the language that the source code is written in. If that code would have undefined behaviour then the proof for that case is trivial, since any target code would satisfy the semantics.
“The value being 0 would be UB” is not a thing. If there is UB, there is no value after that point. (Sorry, that was badly put). It’s correct that the compiler has proved that the value being 0 would have resulted in UB and therefore there are no semantics in that case.No. Prior to the 2000s, something being UB meant “you don’t get guarantees of what happens when you trigger UB”, after the 2000s the behaviour became “UB is an error, and if a path implies UB later on that path cannot happen”.
The semantics of the language say you don’t get a guaranteed outcome after triggering UB, time traveling UB was an invention long after C and C++ had been standardized where the semantics changed from “you don’t get a guaranteed outcome” to “the compiler can claim the existence of UB proves this code path cannot occur”. If a compiler can prove UB occurs, and the compiler is making codegen on the basis that UB is an error, then it should be diagnosed.
You’re just restating the definition I’m arguing is wrong, not justifying it.
The standard was clear that UB imposed no requirements since ’89.
After about that time compilers did start to make more liberal use of transformations whose soundness depended on the fact code exhibiting UB has no semantics. That’s not quite the same thing. The distinction is subtle but, since you seem to be implying that you think compilers are behaving incorrectly, it is important.
The C standard was never particularly clear about “time travelling” UB and perhaps never allowed it. I recently saw a proposal to forbid it (via a non-normative footnote), which I believe was accepted for C ‘26. There may be some cases where GCC/Clang get this wrong although I’ve found it quite difficult to trigger them with current versions. I would be interested to see a counterexample, if you do have any.
In C++ the standard (at least since ‘11, I’m not sure about ‘98/‘03) clearly specifies that there are no requirements for an implementation on an execution of a program which exhibits UB at any point. That’s an explicit allowance for “time travelling UB”. I don’t know if compilers started implementing this before it became part of the standard, but in any case I have also found it very difficult to produce a case showing, with recent versions of available compilers, an instance of this “time travelling” UB.
Edit: sorry, I missed that the example under discussion is actually about “time travel UB”. As I said above, this is explicitly allowed in C++, but at least three compilers in common use (GCC, Clang, MSVC) do not make use of it, at least for this specific example. So far from an “absurd compiler interpretation of UB” it is if anything an example of compilers very clearly not taking full advantage of leeway that is very clearly provided to them.
Seems to me that a large part of the problem is that that is true but so crazy that a lot of people have trouble believing it. Being true but crazy and therefore implausible is common in some of the other fields I’ve worked in but maybe not so common in programming.
I’m sorry, but I don’t get it. Why is that crazy?
If “you don’t get a guaranteed outcome”, what did you expect? Some kind of guaranteed outcome?
If your code is performing UB, then isn’t the result essentially a lack of a guaranteed outcome?
When the compiler can prove that UB will occur, it will show you an error.
But this almost never happens (except in very limited circumstances like when the values are constant) because the compiler almost never can prove that UB occurs.
This example code in my comment shows what I mean.
Here, the compiler cannot prove that
f()will be called with zero as an argument. But it will still remove that “if” check whenCONFIG_VALUEis zero because it knows that this will have no effect on the program, since ifiwas zero, the program would crash [1].UB would only be performed is
f()is called with zero as an argument. But the surrounding code may never callf()with zero as an argument, so the code may well be 100% correct, which means it doesn’t make sense to emit a diagnostic (this would lead to very annoying false positives in all sorts of code, including almost all code containing signed arithmetic).[1] Strictly speaking, it won’t remove the check because the C compiler cannot prove that
printf()won’t have an effect on the program, but for argument’s sake, imagine thatprintf()was a side-effect-free function which the compiler could observe.But the code path does not lead to an error. Remember, the surrounding code never calls
f()with zero as an argument, so the code is correct. In other words,i != 0is a precondition for thef()function.Ok here’s a different example that is possibly easier to consider
The exact same logic that allows the compiler to remove the 0 check earlier allows it to remove the >= 1000 check in this code.
Do you think this is reasonable?
In general, that’s not true.
doSomething()might contain an infinite loop (that does I/O), or it might callexit(), orabort(), etc. – so C compilers cannot remove the>= 1000check in this code because it would affect the behavior of the code.It would only be true if
doSomething()does something “inconsequential” (e.g. if it’s a pure function) and the compiler chooses to inline its code, including any functions it calls (recursively).Although, now that I realize it, the same issue exists with the example I gave, but I guess my point would still be valid if you replace
printf()with some other code that the compiler can fully observe and that doesn’t affect the control flow.Let’s assume that
doSomething()does something “inconsequential”, i.e. that would not affect the control flow, etc, and that the C compiler chooses to inline it. Or replacedoSomething()by such inconsequential code.Then the answer is “yes”, I think it’s reasonable to remove the check because it would lead to a crash anyway and therefore preserving the check would not affect the behavior of the code.
Infinite loops are UB, so the compiler is free to assume all functions complete (and return prior to the introduction of _Noreturn in C11), this is why it can remove the the (I == 0) branch in the
% Iexample. Which is my point: if the code can be removed in the %0 case, it is just as conforming to remove it in the >=1000 example.It’s not a matter of “does the function have side effects” or “can it be inlined”, the definition of what is permissible with UB (according to compilers) is that fact that
%Iis undefined if I is 0, or[I]is undefined if I>=1000 means the compiler can claim it has proved the condition cannot occur. It is not a matter of “does the function do anything, or have side effects” because the existence of the UB “proves” I cannot be 0, or >=1000 (depending on the example).Again we don’t need to, because it cannot be called, because for it to be called requires that
Ihas a value that leads to UB, which “proves”Icannot have the value.That’s literally the problem we’re complaining about: the compiler implementations decided in the 2000s (the earliest example of this interpretation of UB is 2005) that rather than saying “we cannot guarantee what will happen if you trigger UB” they could switch to “if we discover UB we can claim that it proves a path does not happen”.
No, they’re not. Infinite loops are well-defined behavior in C.
There is a clause introduced in C11 which states that a loop which does not do I/O, does not access volatile objects and does not perform synchronization or atomic operations in its body may be assumed to terminate, but this is still well-defined behavior.
No, there is no undefined behavior if
doSomething()callsexit(), for example. Thereturn buffer[n]statement would never be reached.So the compiler is not allowed to remove that check.
To be honest, I think you are very confused about what UB is and how it works… while UB can sometimes have surprising consequences, it’s not as big of a demon as it’s been made out to be, lately.
You’re right, I don’t know anything about UB, just the fallout of the invention of proof-by-UB.
You could argue that the real issue is the gross abuse/overuse of UB in C/C++ which exposes more or less all non-trivial code to UB, but the reality is that none of the proof-by-UB driven optimizations apply to any of the cases of legitimate use of UB in C/C++.
But I also accept that at this point your opinion is that proof-by-UB is legitimate, and we’re never going to agree. My opinion is that the compiler should not be an adversary, and proof-by-UB, especially when UB is so trivial, is adversarial.
I’m sorry but I don’t even understand what you’re saying. What the heck is “legitimate use of UB in C/C++”?
If your code performs UB, then it doesn’t have defined semantics. It is a bug and as far as the C language is concerned, it is basically equivalent to a crash. You’re asking the computer to do something invalid.
So there is no such thing as “legitimate use of UB in C/C++”.
Sure, I agree.
The compiler is not doing anything other than optimize your code. If you think the compiler is being adversarial because it removes code that would have no significant effect on your program, then I don’t know what to tell you…
It is a stunning example of unbelievably terrible/confusing grammar/writing :D
What I was meaning was legitimate use of “undefined behavior” in the language specification, not “legitimate use of UB in code”, because as you say that would be nonsense :D
i.e. the C and C++ spec both use UB in a wide array of cases where the behavior should be either unspecified or implementation defined, and has no legitimate reason to use UB.
Then I have a solution for you: pretend that the C language specification says “unspecified” or “implementation defined” instead of UB.
As long as you’re not looking to rely on implementation-defined behavior (which would make your code non-portable), the effect is the same from the developer’s point of view.
Literally the whole point of this entire thread is that you can’t do that specifically because of the “compilers can optimize on the assumption the UB cannot happen”.
Take the canonical (note that this is not an time-traveling UB scenario this is just UB vs. ID/unspecified example):
The only reason a compiler can remove this check is explicitly because overflow is UB, you cannot “pretend it is unspecified/ID”. If overflow is unspecified or implementation defined that compiler cannot just pretend it does not happen.
The semantics of a operation being specified as UB vs unspecified or ID are completely different.
I think that is true for C++, but not necessarily for C. In any case, it’s worth noting that neither GCC nor Clang (14.1/18.1.0 respectively) remove the call to printf() in this example, for either language, at
-O3or at-Os.I would turn on a warning about possible division by zero. I love compilers that help me avoid mistakes.
Do you realize that this would produce a warning for any code that contains a division by a variable?
Extend the same thing to the other arithmetic operators and every time you use a variable in signed arithmetic (addition, multiplication, etc), you’d have a warning due to possible overflow.
I think that seems a bit too much especially when you can’t prove to the compiler that no mistake would actually happen.
It should only produce a warning when I have not proved to the compiler that the divisor is nonzero with a test or an assertion.
It’s fine if you think it might be too annoying for you (so don’t turn it on) and it’s fine if you think it might be too annoying for other arithmetic operators (I probably agree) but nevertheless I think it would have helped me avoid bugs.
As I understand it, the logic in the compiler is more like the optimizations assume UB never happens. This is obviously false, therefore the optimizations are unsound, therefore they easily produce complete nonsense.
My uncharitable view is that UB is used as a shortcut so that optimizations are easier to write because they don’t have to consider the tricky cases. If the compiler writers are aiming to make optimization easier, they aren’t going to bother proving that UB might happen before exploiting the consequences. The nonsense compilations are capricious because the compiler deliberately fails to understand the program.
Your understanding is not quite correct. The compiler acts according to the principle (which is explicitly stated in the language standard, so it’s more than an assumption) that, if UB has happened, there are no requirements that any particular behaviour should be observed.
One consequence of this is that a compiler can eliminate from consideration, when deciding what target code to emit, any code paths that necessarily exhibit UB. People do often describe this as “assuming the UB never happens” but technically that is wrong. The compiler does not assume that UB never happens, it just applies the fact that there are no requirements for behaviour to cases where it can prove that UB does happen.
Compilers does not deliberately misunderstand programs. This claim that they do implies that you think the compiler should have understanding of the language that is different from what the very document that describes the language semantics says they should. Different compilers trying to guess at what a programmer might have intended is a path to madness and is exactly why the standard exists.
I don’t think the compiler writers can claim that they are not deliberately misunderstanding the program when they wrote the specification that says they are allowed to misunderstand the program.
The standard doesn’t say that implementations are allowed to misunderstand the program.
For optimizations around things like integer ranges, it may be doable in many cases for the compiler to prove things rather than assume things. But this view really starts to break down when you consider UB from stray pointers.
For example, let’s say you’re writing a function that takes a couple of pointers as parameters. One of them points to an input array, and the other is to an output array. The optimizer would like to be able to perform common subexpression elimination, hoist expressions out of loops, re-order things across loop boundaries for vectorization, split or fuse loops, cache values from the input array in registers, etc.
These are all totally doable, highly profitable optimizations, as long as the input and output pointers do not alias. But if writes into the output array might change the input array, suddenly the optimizer’s hands are tied behind its back. For example, those writes might be in between two otherwise-equivalent expressions, which must both be computed just in case. In a loop that writes to the output array, expressions that would ordinarily be identical across every iteration must now be performed repeatedly, just in case. Many values will have to be reloaded from memory over and over, just in case.
Things get even worse when you go beyond bare arrays- for example, if you took two
std::vector<int>&s, can the data pointer of one alias with the length or capacity of the other? Of itself?C and C++ try to address this problem with type-based alias analysis aka “strict aliasing,” where pointers of one type are assumed not to alias with pointers of incompatible type. This is annoying in the tiny fraction of code where you’re trying to mess with byte-level representations, but there are workarounds there, and it’s otherwise an easy rule to understand and follow. Expecting the compiler to prove this is unreasonable- it would require intractable analysis and whole-program compilation.
And it’s still not enough- Fortran is famously better than C at this kind of array crunching because it instead has a rule that function parameters shouldn’t alias. This means even if the input and output arrays have compatible types, the optimizer can still perform these transformations.
(Of course C also has
restrictto cover the cases that strict aliasing does not. But this is merely a question of default syntax: the compiler can’t verify thatrestrictis used correctly any more than it can verify that strict aliasing rules are followed. If you want the compiler to verify this kind of thing you need to add something like Rust’s lifetimes to the type system.)I think it would be safer for compiler writers to provide tools to the programmer so the programmer can explicitly allow the compiler to perform optimizations that the compiler cannot otherwise prove to be sound.
Yes, I also prefer this approach in a vacuum- you can get some nice properties out of it, as seen in e.g. Rust. It is quite a deep change to the language, though, and compiler writers still need to do something useful with all the C and C++ code we’re still using.
C code that exhibits UB does not have well-defined behaviour on any target. The behaviour of C is defined quite specifically in terms of an abstract machine. The implementation (compiler) generates code on a real target to satisfy the semantics of the program on that abstract machine, of which there are none if UB is invoked.
Unsigned overflow is well defined in C since at least C99. I’m not sure unsigned overflow were ever undefined, but I haven’t checked the oldest standards.
Yeah I had to look this up, it turns out C++ just had what is arguably a spec error in not matching C wrt to unsigned overflow, however it doesn’t negate that there’s no reason for signed overflow to be undefined except for it helping benchmarks.
Agreed, I myself am quite angry at this integer overflow madness.
But you can’t rely on it being constant time, while 10 years ago you mostly could. The standard didn’t consider it, and now compiler writers are introducing regressions left and right because the “law” says they can.
I do. I ship C libraries (okay, just the one), and as such can’t rely on my users to activate those flags.
Even if I tried and wrote in big bold blinking letters that the compilation flags used in my build system must absolutely be reproduced, this would still be a violation of the least astonishment principle. If I say I wrote my thing in C, users will likely expect that it is actually written in C, and treat any deviation from strict conformance as a bug, as well as shout on all their forums about how irresponsible I am for shipping code that doesn’t follow the standard. I mean, I already got some shit for relying on implementation behaviour, even though (i) no one could name a platform that behaved differently, and (ii) the exact same choice had already been made by DJB himself.
Then write standard C and treat any deviation from strict conformance as a bug.
Then, additionally, use
-fwrapvin your build system (well, you can use-ftrapv, UBSAN, etc, for running tests) and recommend in your README/INSTALL documents that users also use that flag. The point in this case is not to rely on-fwrapv, but just use it when it’s available to diminish the consequences of bugs in your code.Judging by the amount of debate that has been happening around this, there seem to be different assessments of the overall situation. I’m pretty sure that everyone involved is actually aware of
-fwrapv, so apparently there is more to that.Btw, previous discussion of range checks: https://lobste.rs/s/tp6nxh/gcc_undefined_behaviors_are_getting_wild#c_4hwdzz
When searching for comments in that thread that mention
-fwrapv, the only remaining complaint I see is whether-fwrapvshould be enabled by default or not. To be clear, there is one comment reporting that this option makes code 2x slower in one instance that they checked, so this option is not free: the cost vs benefit of enabling this option by default is subjective, and depends on how much you value security over performance.But again, you can create your own Linux distro or variant that enables this option. As I mentioned, some Linux distros already allow you to enable this option globally (at the cost of having to build software locally). And judging from one comment in that thread, certain software such as the Linux kernel already enables this option by default and there are open bugs for Firefox and Chrome to enable this option as well (at least 1 year ago, apparently). If you’re concerned about this, it might be worthwhile to check those bugs to see whether they have enabled this option (or why they haven’t).
Yes! Or you could advocate for compiler writers to rethink their choices.
Good luck with that :) As I said, the decision is subjective so not everybody agrees with you. Another suggestion for taking matters into your own hands instead of complaining about someone else’s projects: you can fork gcc and clang, change
-fwrapvto be enabled by default, publish the forks with names such asgcc-secure/clang-secure(although, check for trademarks first) and otherwise keep them in sync with upstream.If you really think your forks will be better than the original projects, people will adopt them and they will either supplant the original projects over time, or force the original projects to make the same change. That’s how free software works, right? It would be similar in spirit to certain aspects of what happened with the whole gcc, pgcc and EGCS saga, only it would be 100x easier to do since you’d only be making a trivial change.
I mean, technically of course I understand what you’re talking about, but I still believe that it’s not a very realistic and useful approach to treat gcc and clang like somebody’s pet projects. They are organizations deeply integrated into the industry, having their importance derived from the industry (including you and me) and we, as industry participants, have all rights to complain, demand, negotiate and convince members of those organizations to improve their practices and processes.
I agree, but at a certain point you have to accept that there will always be disagreements, especially around issues such as these, which are subjective.
This debate is not new by any metric, it’s certainly been going on for many years and I believe that both the decisions at the level of the C standard and at the level of the major compilers were taken quite consciously and deliberately, so I don’t think complaining about this indefinitely will do any good.
In fact, I think these issues are usually decided by measuring industry adoption and as far as I can see, the industry has decided to favor performance over security with regards to this specific issue. The only major outliers in this matter seem to be the Linux kernel and Rust, the latter of which being a different language altogether (with a different focus and priorities).
So I think it would be more productive, say, to convince browsers, Linux distros and other major C/C++ software projects to change their default compilation options than to convince compiler developers, who in general prefer to choose defaults that agree with the surrounding software ecosystem. Presumably, at some point the adoption would be enough to tip the scale into favoring
-fwrapvby default, and C standard modifications usually follow compiler and industry practice.Perhaps it is time to take things to a more personal level.
I’m not talking about organisations, I want the name of the actual person who made the call. Given their impact, they should be public figures.
I bet my hat this could easily be fixed at the source level. See the differences in generated code, and rework the corresponding source code to the compiler can more easily transform it.
So you are asking compiler vendors to not remove code they can prove to be useless?
The compiler removes bounds checks in cases where something did happen that would be UB if it was out of bounds before the bounds check happens… It’s a bug in the code, not in the compiler.
It would be great if the compilers helped to spot this bug instead of silently removing the code though.
Some CPU instructions are known to be variable-time, but most simple instructions are constant-time in CPUs in actual use. One example of this is the recent KyberSlash attack [1], which exploited the variable-time of the x86 DIV instruction for key recovery.
[1] https://kyberslash.cr.yp.to/
My understanding is that in general, software developers tend to rely on such observed timings but mainstream CPU vendors make no guarantees that these instructions are actually constant-time (outside of certain exceptions like the document I linked to, which requires enabling a special CPU mode, or perhaps older or simpler CPUs).
One of my points is that it’s not safe to make this assumption. There are many reasons for why an instruction of a modern CPU can be faster or slower to execute than expected and this may depend on many non-trivial factors. The other point is related to guarantees in the C language:
This is a brilliant demonstration of my points: two flawed assumptions were being made here: 1) that compilers would always translate integer division operations into the DIV instruction on x86 (which is not necessarily true), and 2) that the DIV instruction was constant-time (which doesn’t seem to be true either).
It’s important to realize that even if the DIV instruction was constant-time, the C language makes absolutely no guarantee that integer division is constant-time, so C compilers are free to use whatever CPU instructions they want to, including variable-time instructions. The actual CPU instructions that compilers use may even change depending on the surrounding (and even distant) code, compiler versions, compiler platform, available libraries, optimization levels, and even other unrelated compiler flags.
This is even mentioned in the FAQ of that issue:
And this is not limited to division, it applies to virtually any language construct.
What matters is whether that is data-dependent or not. Division speed is often data-dependent; multiplication has been occasionally data-dependent; variable rotation/shift will be on chips without a barrel shifter. I haven’t heard of a chip where boolean instructions or single-word add/sub are data-dependent, though it’s not impossible.
No, that was not the assumption. The assumption was that division by a known constant would be turned into constant-time multiplication plus shift, but when optimizing for size compilers will sometimes choose DIV instead. DIV has been known to be variable-time for a long time. The fix was to do the multiplication dance by hand [1] (though nothing stops the compiler from recognizing the division and still outputting DIV).
Of course the core issue here is the lack of predictability and/or control over what compiler optimizations are performed.
[1] https://github.com/pq-crystals/kyber/commit/dda29cc63af721981ee2c831cf00822e69be3220
This doesn’t matter. The timing of multiplication instructions are also known to be data-dependent on certain CPUs (like certain ARM CPUs) and platforms. This is even mentioned in the FAQ.
Relying on multiplication instructions being constant-time is making exactly the same mistake as relying on division instructions being constant-time. The timing of these instructions can vary depending on the exact values of the operands and many other non-trivial factors. There is a reason why CPU vendors don’t make promises about these instructions having data-independent timing.
Exactly. And furthermore, even despite compiler optimizations, the timing of multiplication instructions can also be data-dependent, as demonstrated in this security flaw. So that “fix” is not really a guaranteed fix for the issue, in principle it’s just papering over it.
Indeed, but not just that: it’s also the lack of timing guarantees in modern CPUs. This document by Intel indicates they seem to be trying to address this issue, at least.
But this also needs to be addressed at the compiler or C language level, i.e. some construct needs to be created that forces compilers to use such special CPU timing modes / instructions (and error out if they’re not available), otherwise the only way to use this reliably is by coding in assembly.
Intel’s DOITM only makes things more complicated, seeing that it can only be enabled by writing to an MSR which requires kernel-level access, leaving it to the operating system to choose whether to enable, disable, or expose it to the user. But even in the case where one CPU maker gave guarantees about anything, that still leaves you with every other maker that doesn’t.
But anyway, CPUs are a hell of a lot more stable than compilers when it comes to suprising behavior. You can find some CPUs where multiplications are not constant-time, such as the M3 (and the Intel 486, and the PowerPC 7450, and …), but that’s the exception rather than the rule. Whereas even the same compiler will surprise you across versions, optimization levels, etc.
Okay, great, CPU makes no timing guarantees at all, and we’d be irresponsible to rely on them on this one. Now I’m at the hotel with no other connectivity than their public WiFi, I need to make a purchase, have a private conversation, whatever, I need a secure channel.
What do you propose?
Good, the first step to fixing a problem is admitting you have one :)
I propose three things:
This feature would allow you to make cryptography safe in assembly.
This feature would allow you to make cryptography safe in implementation-defined C.
This feature would allow you to make cryptography safe in standard, portable C.
It would take many years to implement these steps, but I don’t see any other feasible way to achieve the same goals.
Edit: While 1) is not available, you can still implement 2) using normal CPU instructions that are usually constant-time. While not necessarily perfectly safe, it would still be a big improvement compared to the status quo.
And when 2) is not available, please please do not introduce any change in the compiler toolchain that causes it to insert any branches in code that previously didn’t have any. Also, scour for patches that made those kind of transformation, and revert them.
In fact, I would urge compiler writers to keep those changes reverted whenever 2) does not apply, so that code that is compiled against old standards doesn’t suddenly start breaking.
This reminded me of “Desugaring Haskell’s do-Notation into Applicative Operations” (2016) by Simon Marlow et al.: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/desugaring-haskell-haskell16.pdf
To me the answer is clear: our DBs desperately need sum types/tagged unions/rust-style enums.
(And they need to be structurally typed, like in Roc, or Ocaml’s polymorphic unions, just as structs/rows are structurally typed in SQL)
Yes. My theory is that relational model was being implemented (as SQL) at uniquely unlucky time: before sum types and newtypes got traction in programming languages.
Lack of sum types has led to the entire NULL confusion. Lack of newtypes causes mistakes related to comparing IDs of different entities. There is an idea of domain in SQL standard but somehow it did not go to production really.
Postgres does have domains, and custom types.
But sadly they suffer from limitations which oft make them unsuitable (e.g. no foreign keys). And no sum types though it does have basic enumerations (which also have issues iirc they’re always 4 bytes).
Also using them in practice is a pain. Even basic enums are not well supported by many clients. I have no hope for custom types. Unfortunately many libraries stick to the bare minimum implementation.
The null problem in C/C++ pointers and Java and the JRE is different from NULL in databases. In the former, null is always possible and checking for null is usually not comprehensive because it would clutter the code too much with unnecessary checks. In SQL, columns can be NOT NULL. So NULL in SQL is like the simplest sum type:
Option<T>. It’s a design decision in the model to make something nullable.It does feel like SQL just missed the mark on this, though, because actually it’s a design decision to make something not nullable. The fact that non-nullable is an option is great, but the default feels backwards.
If I could improve only one thing in SQL, this would be it.
I’m not sure I agree with this. The requirement is “stable numbering for comments…If the comment is removed, we want the numbering for the remaining comments to stay intact.” So sequential numbers on comments aren’t guaranteed to be valid anyway, so why not accept the comment id as your numbering? Later comments are going to have higher comment ids due to how they’re being allocated.
Actually, a better example is very easy, thank you for nudging me to think about that a bit more.
Imagine Github. It has repositories, and in each repository the numbering of issues is expected to be sequential per-repo. You certainly want a stable issue number. You certainly can expect the issues to be deleted.
The idea is that you have the evidence that the comment was there. And by “sequential” I did not mean “without gaps”, I just don’t know the right words. Like if you have page numbers in the book, and if a page is torn out you know that there was a page. Like a proto-blockchain.
Anyway, the exact business requirement is not much important, it’s an example of path dependence, and most examples are probably debatable per se. Maybe there is a better example for that pattern, I just did not think it up yet.
If you mean that, for any two comments A and B, if B’s time of posting is higher (more recent) than A’s, then B’s number must be higher than A’s, I would say that the comment number is increasing (more specifically, non-decreasing) with respect to the time of posting.
It’s great to point out usability issues with open source software, especially issues that arise in the interactions between multiple projects. However, this article contains a number of somewhat odd observations.
This is simply false. I currently use Nheko on desktop and FluffyChat on Android, and both are updated regularly. A variety of other options exist.
This is very strange. While it’s obviously bad that this is a possible failure mode, I’ve never heard anyone else report this; I’m curious if others here had the same experience.
What does this mean? I have never used an e-mail to look up a Matrix account; everyone I know who uses Matrix just lists their Matrix ID on their social media profiles or website, or communicates it directly. I often have success simply searching for people’s habitual usernames, if we happen to be in the same large rooms.
Statements like this are unnecessarily negative. I have run a Matrix server for a long time - about four years now - and have “non-technical” friends and family members messaging me via Matrix regularly. Several communities I’m in use it heavily as well. It is useful, however much improvement it requires.
I’m glad this post is urging the Matrix team to tighten their ship, but I feel we, as a community, can make that kind of change without being intentionally and excessively negative.
Matrix does have support for discovery based on third party IDs. Basically you upload some hashed IDs to some service and it tells you related Matrix IDs. I have found like 2 contacts this way.
But this complaint seems a little asinine because:
I have been operating a couple of Matrix home servers for about five years as well, for personal use and for our work chat at Oxide. There have definitely been issues, it’s not perfect! But while I don’t know this person (thank goodness) I am familiar with the sort of tone in their writing: you can tell when somebody is trying not to enjoy something. I did enjoy the mixture of effusive and even obsequious responses from Matrix folks – anybody who has had customers or a community to manage can immediately sense the exhaustion behind those carefully cheerful words.
That may be true, but I think it reflects a very common attitude people are likely to have when signing up. Early on all your users will be there because they’re enthusiastic about distributed chat networks or whatever, but if your service is the slightest bit successful, most signups will be people thinking “oh great, I have to sign up for YET ANOTHER account just to talk to X”. They aren’t going to be in the mood to cut you any slack the way the first group did.
Okay, and if someone’s reaction, while in that mood, is to go on the internet and loudly proclaim “trashfire”, they should be punched in the mouth once for every word they write.
It’s possible that people just give up at that point and never bother reporting it. I’d certainly give up.
Just read the book!
Kind of kidding, but kind of not. The definitions are all in there, though it is rather long.
If you open google, search for “how does left join work”, I wonder how far do you need to paginate before you reach this PDF or anything similar.
I guess I’ve never experienced this confusion, but I do find that explaining SQL in imperative terms can cause confusion. And some of those quoted tutorials are just sad.
I would explain JOIN in a non-theoretical way as: You get all the rows from the left table, each combined with the matching rows from the right table. If you say LEFT, the nonmatching left table rows will appear once, combined with NULLs. If you don’t, the nonmatching rows won’t appear at all.
Explaining this as an imperative algorithm just invites bad assumptions. The process may or may not be done by running a nested loop — in fact, if that’s what the query plan says, you may have a performance issue to fix — and those rows don’t have to come out in any particular order unless you ask for it.
For me, “all the rows” implies that there are roughly the same number of rows in the output as it was in the first table. In the case where cross product is triggered (such as ON 1 = 1), there are 3 rows in the left table and 15 rows in the output table.
I wouldn’t say “all the rows” in this case, because that’s five times as many rows than I had.
Ah, I see the problem now! I guess I’m thinking of it in a more theoretical way than I realized. When I say “rows” I really mean “relations”, which don’t have unique identity, they’re just combinations of values. So it doesn’t imply a 1:1 relationship with the output relation. But that “multplication” behavior is what I meant by “each combined with the matching rows”.
Nit: the rows are called tuples, with a relation being the equivalent of an entire table (a set of tuples (the body) along with the heading (that all the tuples in the body must match)).
I think that what happens here is that you actually have a correct mental model that you picked up from a text that carefully explained it. So you’re fine.
But maybe when you read another explanation, you don’t really notice that it’s not saying what you know but something completely different. Unless you’re in the proofreader mode, it may be hard to see such things. Here you need to deduce the mental model that the text is trying to establish, but you overcorrect it to your own.
But when somebody who is actually learning this stuff reads the “wrong” text, they read exactly what was said, and that’s the mental model they would be having. Maybe at some point they would realize that it’s wrong, but I can see why you can spend MANY years and not really get the fully correct picture. Because most of the queries that they would encounter would actually work! And most of their queries would actually work! It would only break sometimes, but we’re so much used to things sometimes mysteriously breaking and then mysteriously fixed. Nobody got an energy to dig down every time.
And it’s already hard for the novice: they learned from ChatGPT, then they wrote a query, it breaks, they ask for your advice, you change something in the query, and it works. What’s the chance the novice would start a “debugging session” trying to understand why their query was wrong, why your query is correct, what mental model you both had, etc.
Exactly, it’s really hard for an experienced person to recognize which wrong turn a novice has taken, based on external evidence. One reason teaching is a very different skill from doing! And a reason having a teacher (even occasionally) is better than going it entirely alone if you can manage it. (I just started learning the cello and am in that state right now!)
I’m in a database class right now and kinda know something about this.
A JOIN is a cross product followed by a select. Importantly a select is not about adding rows to the output but removing ones that shouldn’t be there. It sounds like unimportant trivia but is meaningful if you want a trully generalized explanation that doesn’t fail on edge cases.
For example: if you have a natural join where no column names match: the result is a cross product (because nothing was removed because the select had no condition).
It’s extremely unintuitive and seems like a collection of trivia, but the root of a lot of the confusing behavior comes from join being a cross product followed by a select.
So this statement:
Is backwards, it wouldn’t add the data as it would already be in the output (due to the cross product), it would instead not remove it since it matched the condition. anything that doesn’t match the condition is removed.
It seems like the same thing, but it violates intuition when you start looking at a JOIN that is missing conditions.
Unfortunately, this intuition breaks for left join. Imagine a single-row first table, and empty second table. Their cross product is empty, but the result of any left join will have one row.
Great callout. I’ll poke my instructors about it. One thing they mention is that most RBDMS have some functionality that diverges from the theoretical correct outcome, this might be one excuse they throw out.
I hate how I can’t get to a single unified “why things work the way they do with no edge cases”. I’m quite good at building and retaining frameworks but quite bad at memorizing a series of random trivia.
Unfortunately the Georgia tech database course and the textbook The Fundamentals of Database Systems are all over the place.
It’s a cross product and select as you say. The outer table (left, right, both) is augmented with a null row in an outer join.
Doesn’t the null row only show up when there are no matches? That means outer join is not monotonic: adding a row to the input can remove a row from the output. And that means it can’t be made of cross product and select, because those are both monotonic.
The definition in @isra17 ’s comment uses set-subtraction, which is non-monotonic. Adding rows to
Scan add rows to(R join S)which can remove rows fromR - (R join S).ahhh that’s the word I’ve been looking for. I was trying to point my finger at what’s weird about LEFT JOIN.
Thank you!
That seems to fit. Did you get that from somewhere you can link to?
https://en.wikipedia.org/wiki/Relational_algebra, “Left outer join” section.
No I’m just extending your model to fit the standard behavior of an outer join.
Although perhaps a better definition would be that it is the cross product and select of the inner join union with the set of tuples from the outer table that don’t exist after the select.
The algebric definition is something like this (From wikipedia):
Oh, cool! Can we use you as informant to try and debug how such things happen in an actual university? :)
I wonder what is this about exactly. One thing that comes to mind is maybe NULL semantics? Like, depending on how the request environment is configured. This sounds pretty weaseling out’ish.
I think the most commonly cited example is that the relation model specifies that relations should have set semantics, ie each tuple (row) can occur at most once in the relation, whereas SQL allows duplicate rows.
Codd himself wrote about this, and you can look up practically any text or video by Chris Date to get (good) rants about what SQL got wrong and why it matters.
I think it’s because LEFT JOIN is an shorthand for LEFT OUTER JOIN. Where the “a join is a cross product followed by a select” is for a JOIN i.e. an inner join. Which is confusing in and of its self (that you can have an outer or an inner without specifying the type exactly).
From the textbook (p264):
So yes, it’s an edge case, but intentionally so (in that they created a new concept for it).
Frustratingly it doesn’t reference this cross product conundrum. I.e. beyond the property that “keeps every tuple in the first, or left,” how does it actually do that behind the scenes?
Well, wouldn’t you say now that the algorithm description in the post makes it all clear?
No it’s missing details that are important.
I’m not looking for a “this is how it behaves” I’m looking for a “this is why it behaves this way. From the deepest levels.”
A full outer join then?
FULL OUTER JOIN is equivalent to two LEFT JOINs btw: https://www.evernote.com/shard/s2/sh/aaf55786-281e-6b11-c3f0-232d72a5596b/fcf3bf6eb8ad50d25471cfc2c5f91a44