Threads for 5d22b

    1.  

      Control flow integrity for Rust definitely interests me, but is there a version that doesn’t require JavaScript to view?

    2. 13

      Off topic, I’d like to compliment the elegant, understated visual design of this blog (including for making me aware that CSS can do hanging punctuation).

      1. 6

        Thanks! Re: hanging punctuation—it involves some serious shenanigans (and indeed is implemented as such on the blog), but hanging-punctuation is landing at last (Safari has it now, and hopefully FF and Chrome will get it), and that will make it much easier for everyone to use, which will be great!

        1.  

          I have to second @5d22b’s compliments! The design - and typography :) - is lovely. I went to read this one article and ended up spending a couple of hours on your site.

    3. 2

      I assume this will have some of same limitation on recursive functions as async does? Namely that you can’t recursively call gen functions without boxing the return value.

      1. 2

        Since it is basically like async (AFAIU) in the 2024 edition, yes. Indeed I tried it on the playground:

        error[E0733]: recursion in an `async fn` requires boxing
         --> src/main.rs:3:1
          |
        3 | gen fn count_to_three() -> i32 {
          | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ recursive `async fn`
          |
          = note: a recursive `async fn` must be rewritten to return a boxed `dyn Future`
          = note: consider using the `async_recursion` crate: https://crates.io/crates/async_recursion
        
        1. 3

          I’m somewhat disturbed to see rustc recommend a third-party library like that. That library has been audited as non-malicious, but it might still be compromised later. (On the other hand, I suppose it’s consistent with Rust (and people in general) historically not paying much attention to supply-chain security.)

          1. 3

            As you say this is a long-standing problem. I hope there are talks to get the repo under the rust-lang umbrella, otherwise recommending something third party is bad optics for me.

    4. 12

      My work project, Couchbase Lite, has been doing something like this since 2018. I came up with a binary JSON-equivalent data format called Fleece that doesn’t require parsing, and all the JSON data is stored in SQLite as Fleece blobs. (This is invisible to the API, just an internal representation.)

      1. 2

        That’s pretty cool work! :)

        By the way, JSONB says it’s from postgres, and from a quick check the earliest version that has it available is 9.4, which was released in December 18, 2014. Looks like they never bothered to promote or standardise it much?

      2.  

        Fleece

        For future reference, someone else submitted this recently: https://lobste.rs/s/iq0uwa/fleece_super_fast_compact_json

    5. 1

      Note: Adding the lang attribute is particularly important for assistive technology users. For instance, screen readers alter their voice and pronunciation based on the language attribute.

      I’m sorry, it is the year 2023. It is trivial to identify the language of a paragraph of text, and, if you fail and just use the default voice, any screen reader user will be either a) as confused as I would be, reading a language I clearly don’t understand, or b) able to determine that they are getting German with a bad Spanish accent, assuming they speak both languages. Please, please, please, accessibility “experts”, stop asking literally millions of people to do work on every one of their pieces of content, when the work can be done trivially, automatically.

      1. 6

        These are heursics, and not always correct. Especially for shorter phrases it is very possible that it is valid in multiple languages. I think it is of course good they threse heuristics exist but it seems that it is best to also provide more concrete info.

        The ideal situation is probably both. Treat the HTML tags as a strong signal, but if there is lots of text and your heuristics are fairly certain that it is wrong consider overriding it, but if it is short text or you aren’t sure go with what it says.

        Makes me wonder if there is a way to indicate “I don’t know” for part of the text. For example if I am embedding a user-submitted movie title that may be another language. I could say that most of this site is in English, but I don’t know what language that title is, take your best guess.

        1. 5

          Makes me wonder if there is a way to indicate “I don’t know” for part of the text.

          From https://www.loc.gov/standards/iso639-2/faq.html#25:

          1. How does one indicate undetermined languages using the ISO 639 language codes?

            In some situations, it may be necessary to indicate that the identity of the language used in an information object has not been determined. If the situation is that it is undetermined because there is no language content, the following identifier is provided by ISO 639-2:

            zxx (No linguistic content; Not applicable)

            If there is language content, but the specific language cannot be determined a special identifier is provided by ISO 639-2:

            und (Undetermined)

          Also in fun ISO language codes: You can add -fonipa to a language code to indicate IPA transcription:

          From my resume:

          <h1 lang="tr">
          	<span class="p-name">Deniz Akşimşek</span>
          	<i lang="tr-fonipa">/deniz akʃimʃec/</i>
          </h1>
          
      2. 3

        It is trivial to identify the language of a paragraph of text

        It’s an AGI-hard problem…

        Consider my cousin Ada. The only way a screen reader (or person) can read that sentence correctly without a <span lang=tr> is by knowing who she is.

        What is possible, though far from trivial, is to apply a massive list of heuristics, which is sometimes the best option available, i.e. user-generated content. However, When people who do have the technical knowledge to take care of these things don’t, responsible authors who mark their languages will then have to work around them.

        1. 1

          But never, in all of human history, has a letter, or book, or magazine article ever noted your cousin’s name language in obscure markup. That’s not how humans communicate, and we shouldn’t start now.

      3. 2

        I write lang="xy" attributes. I, for one, certainly would prefer that the relatively small number of HTML authors take the small amount of care to write lang="xy" attributes, so that user agents can simply read those nine bytes, than that the much larger number of users spend the processing power to run the heuristics to identify the language (and maybe fail to guess correctly). Consider users over authors. Maybe, if one considers only screen readers, the effect shrinks away, but there are other user agents that care in what language is the text on the Web, as common as Google Chrome, which identifies the language so that it can offer to Google-Translate it.

        1. 2

          I, for one, certainly would prefer that the relatively small number of HTML authors take the small amount of care to write lang=“xy” attributes, so that user agents can simply read those nine bytes, than that the much larger number of users spend the processing power to run the heuristics to identify the language (and maybe fail to guess correctly).

          This is the fundamental disconnect. You are not making this ask of the “relatively small number of HTML authors”. You are making this ask of literally every single person who tweets, posts to facebook or reddit, or sends an email. This is an ask of, essentially, every person who has ever used a computer. The content creator is the only person who knows the language they are using.

    6. 6

      It’s actually pronounced more like “yo” in Chinese (尤 Yóu). I’m not sure why he doesn’t correct people who say “yu”. Maybe he’s just given up and decided that how you say his name in English.

      1. 3

        Having a name that is misspelled sucks. Either you don’t give a shit and stop correcting people (and the people around you who find out they’ve been mispronouncing your name will be hurt that they weren’t important enough to be corrected), or you give a shit and you dedicate 3% of your life to correcting people, some of whom will insist you don’t know how to pronounce your own name. It’s a no-win situation.

        Don’t ask me how I know. =\

        1. 1

          People tend to get my first name right, but I’m astonished at how badly people manage to get my surname wrong. And I’m not talking about people who are native speakers of a language with a totally different set of base phonemes. It’s an Irish name, but is spelled in English exactly as it’s pronounced, you a lot of people manage to get it wrong. If I had a non-British name, I’d have very low expectations of native English speakers being able to get it right (French people soften the ch because they don’t have a hard ch sound, but aside from that typically get it closer than a lot of English people who throw random rs in the middle or change the vowel sounds).

          1. 1

            French people soften the ch because they don’t have a hard ch sound

            They might not think of it in the same way as Anglophones think of English “ch”, but French tch represents the same thing, as in French Tchad (English “Chad”, the country).

      2. 2

        Good to know! I listened to him pronounce his name and tried to mimic him, but I guess I got it wrong.

        1. 3

          Chinese and Vietnamese (and some other languages from the area, not Japanese tho) both hinge on differences between intonation on vowels that sound the same to somebody who only heard a western language growing up. You’re almost certainly going to get it wrong no matter how much you try, so getting it wrong in the same way most westerners do is the best way to actually be understood.

        2. 3

          Yeah, I sort of imagined you had a pre-show pronunciation check given the level of professionalism of your show. (Good episode, BTW. Can’t believe he didn’t tell his wife.) I think he must have just given up and gone to Yu to make things easier. I know my friends named “Zhang” usually just say “zang” instead of something more like “jong”.

          1. 4

            Can’t believe he didn’t tell his wife

            I know right! That was the craziest thing to me.

            I sort of imagined you had a pre-show pronunciation check given the level of professionalism of your show.

            What I usually do is just ask people to say their name, and go off that. But yeah, that is still subject to my abilities.

          2. 3

            gone to Yu to make things easier

            I note that (assuming Pīnyīn) the pronunciation of “Yu” is not like the English word “you” /ju/ but is canonically /y/, which is unknown in standard varieties of English.

      3. 1

        I was curious if there was a video of him saying his name, and found this, around 5 seconds in.

    7. 5

      About “Avoid bike shedding about syntax”. I think the key part here the author seems to glance over is “bike shedding”. I don’t think there’s much people who hold the belief that “syntax does not matter”, except for the ocasional Lisp-er (and even there, people are reasonable, that’s why clojure is so successful with their []s and {}s). But it’s just too easy to get into a very unproductive territory if you let everyone voice their opinion about syntax.

      Syntax, in my opinion, is a part of language design that benefits the most from having a single vision and the BDFL style of development. The fact Rust is brought up a lot in the post makes me inevitably contrast the BDFL style of, say, clojure, with Rust’s development model which way too often gets stuck on syntax-related bike-shedding. Sometimes, it even gets into what one could only describe as the Streisand’s Bike Shed Effect: https://doc.rust-lang.org/std/ops/struct.Yeet.html, where not wanting to bike shed about syntax lead to the world’s largest bike shed planning committee being formed!

      1. 2

        Bike shedding is totally OK once the broad strokes are there. People appear to enjoy this stuff, so it’s a good way to attract contributors. And by moving the bike shedding into a committee, it’s kept from impinging on the project.

        Rust is a good counterexample for the “BDFL effect” you postulate: A BDFL would never have used ? the way Rust does, and would likely have sneered at postfix .await, yet in Rust both decisions have shown great merit.

        I recently got into a bikeshedding discussion myself, regarding the offset_of! macro: One person wrote an extension to allow multiple subfield offsets (Such as offset_of!(Foo, bar.baz) given their syntax. An unfortunate outcome of that syntax is that the code has to do weird things around nested tuple structs (think of offset_of!(Foo, 0.0.0)), so I proposed simplifying to a comma-separated list of subfields (and Variant::subfield paths). The discussion is unsurprisingly ongoing, but it doesn’t hinder the development in the slightest: The current syntax has been merged behind a feature flag, so we have enough time for the discussion to peter out before stabilization.

        1. 1

          A BDFL would never have used ? the way Rust does, and would likely have sneered at postfix .await, yet in Rust both decisions have shown great merit.

          What are your reasons for this view? I would think that, for Rust to have made these decisions, multiple people must have supported these decisions, and I would think that at least one of those people also would have done so if that person had been BDFL.

          1. 1

            Because both decisions were made after a lot of at times heated discussion on the topic, where a BDFL would have called the attendees to order which would have stymied the process. Also have a look at a blog by our might-have-been-BDFL for further examples.

    8. 32

      Not entirely true. There is a mathematical structure known as a wheel (https://en.wikipedia.org/wiki/Wheel_theory) where division by zero is well-defined, where it maps to a special “bottom” element. I have never seen anyone use this, or to seriously develop “wheel theory”, but it is fun to know that it exists!

      1. 30

        I actually wrote my Bachelor thesis about this. See chapter 3.1, where I actually proved that division by zero is well-defined in terms of infinite limits in this projectively-extended form of real numbers (a “wheel”). See Definition 3.1, (3.1e) and Theorem 3.6. It all works, even dividing by infinity! :D

        What I noticed is that you actually do not lose much when you define only one “infinity”, because you can express infinite limits to +infinity or -infinity as limits approaching the single infinity from below or above (see chapter 3.1.1).

        Actually this number wheel is quite popular with next generation computer arithmetic concepts like unums and posits. In the current Posit standard, the single infinity was replaced with a single symbol representing “not a real” (NaR). It’s all very exciting and I won’t go into it here, because I couldn’t do it justice just how much better posits are compared to floats!

        One thing I’m pretty proud of is the overview in Table 2.1, which shows the problem: You really don’t need more than one bit-representation for NaN/NaR, but the IEEE floating-point numbers are very wasteful in this regard (and others).

        While only 0.05% make up NaN-representation for 64 bit (double) floats, they make up 0.39% for 32 bit (single) floats and 3.12% for 16 bit (half) floats! The formula to calculate the ratio is simple: If n_e is the number of bits in the exponent and n_m is the number of bits in the mantissa, then we have 2^(1+n_e+n_m) floating point numbers and 2^(n_m+1)-2 NaN representations. To get the NaN-percentage, you obtain the function “p(ne,nm) = 100.0 / (2^(ne)) * (1 - 2.0^(-nm))”.

        Mixed precision is currently a hot topic in HPC, as people move away from using doubles for everything, given they are often overkill, especially in AI. However, IEEE floats are bad enough for >=32 bits, let alone for small bit regimes. In some cases you want to use 8 bits, which is where IEEE floats just die. An 8 bit minifloat (4 exponent bits, 3 mantissa bits) wastes 5.5% of its bit representations for NaN. This is all wasted precision.

        The idea behind posits is to use tapered precision, i.e. use a mixed-bit-length exponent. This works really well, as you gain a lot of precision with small exponents (i.e. values near 1 and -1, which are important) but have a crazy dynamic range as you can actually use all bits for the exponent (and have implicit 0-bits for the mantissa). In the face of tapered precision, the custom float formats by the major GPU manufacturers just look comically primitive. You might think that posits would be more difficult to implement in hardware, but actually the opposite is true. IEEE floats have crazy edge-cases (subnormals, signed 0, etc.) which all take up precious die space. There are many low-hanging fruits to propose a better number system.

        Sorry for the huge rant, but I just wanted to let you know that “wheel theory” is far from obscure or unused, and actually at the forefront of the next generation of computer arithmetic concepts.

        1. 2

          Oh this is absolutely fascinating, thank you!

          Though if I understand the history correctly, the original intent of leaving tons of unused bits in NaN representations was to stash away error codes or other information that might be generated at different steps in a numerical pipeline, right? They never ended up being actually used for that, but what did happen eventually is that people started stashing other info into them instead, like pointers, and we got the NaN-boxing now ubiquitous in dynamic language runtimes like JS and LuaJIT. So it’s less a mistake and more a misdesign that turned out ok anyway, at least for the people doing things other than hardcore numeric code. That said, you can’t really NaN-box much interesting info inside an 8-bit float, so the IEEE repr is indeed wasteful there, especially when the entire goal of an 8-bit float is to squeeze as much data into as little space as possible.

          Out of curiosity, does the posit spec at all suffer for having a single NaR value instead of separate NaN and infinity values? I vaguely understand how you can coalesce +inf and -inf into a single infinity and it works out fine, but when I think about it in terms of what error cases produce what results, to me infinity and NaN express different things, with NaN being the more severe one. Is it just a matter of re-learning how to think about a different model, or are there useful distinctions between the two that posits lose?

          1. 3

            Thanks for your remarks!

            As far as I know, there was no original intent to allow metadata in NaN-representations and JS/LuaJIT just made smart use of it. It’s always the question what you want your number format to be: Should it be able to contain metadata, or only contain information on a represented number? If you outright design the format to be able to contain metadata, you force everybody’s hand because you sacrifice precision and dynamic range in the process. If you want to store metadata on the computation, I find it much more sensible to have a record type of a float and a bitstring for flags or something. I see no context where outside of fully controlled arithmetic environments, where you could go with the record type anyway, one would be able to make use of the additional information.

            Regarding your other point: Posits are not yet standardized and there’s some back and forth regarding infinity and NaR and what to use in posits, because you can’t really divide by zero, even though it’s well defined. I personally don’t see too much of an issue with having no infinity-representation, because from my experience as a numerical mathematician, an unexpected infinity is usually the same as a NaN condition and requires the same actions at the end of the day, especially because Infs very quickly decay into NaNs anyway. This is why I prefer NaR and this is what ended up in the standard.

            The only thing I personally really need in a number system is a 100% contagious NaR which indicates to me that something is afoot. An investigation of the numerical code would then reveal the origin of the problem. I never had the case where an infinity instead of a NaN would have told me anything more.

            To be completely clear: Posit rounding is defined such that any number larger than the largest posit is rounded to the largest posit (and the smallest accordingly). So you never have the case, in contrast to IEEE floats, where an input is rounded to NaR/+-infinity. Given +-infinity is by construction “not a real”, I find it to be somewhat of a violation to allow this transition with arithmetic operations that are well-defined and defined to yield only reals.

            Dropping infinity also infinitely reduces the necessary complexity for hardware implementations. IEEE floats are surreal with all their edge cases! :D

            1. 1

              The original intended use case for NaNs was that they should store a pointer to the location that created them, to aid in debugging.

          2. 2

            I vaguely understand how you can coalesce +inf and -inf into a single infinity and it works out fine

            But you lose some features. My vague understanding is that +/-0 and +/-inf exist to support better handling of branch cuts. Kahan says:

            Except at logarithmic branch points, those functions can all be continuous up to and onto their boundary slits when zero has a sign that behaves as specified by IEEE standards for floating-point arithmetic; but those functions must be discontinuous on one side of each slit when zero is unsigned. Thus does the sign of zero lay down a trail from computer hardware through programming language compilers, run-time support libraries and applications programmers to, finally, mathematical analysts.

            1. 1

              Yes, this was the post-justification for signed zero, but it creates much more problems than it solves, creating many many special rules and gotchas. If you do proper numerical analysis, you don’t need such things to hold your hand. Instead, given it’s totally unexpected for the mathematician, it leads to more errors.

              It’s a little known fact that Kahan actually disliked what the industry/IEEE did to his original floating point concepts (I don’t know how he sees it today), and this is not the only case where he apparently did some mental gymnastics to justify bad design afterwards to save face in a way.

      2. 3

        I had never heard of that concept. I want to share how I understand dividing by zero from calc2 and then relate that back to what you just shared.

        In calc 2 you explore “limits” of an equation. This is going to take some context, though:

        Understanding limits

        To figure out the limit of 1/x as x approaches 1, you would imagine starting at some number slightly greater than 1, say 1.1 and gradually getting smaller and checking the result:

        • 1/1.1
        • 1/1.01
        • 1/1.001
        • etc.

        But that’s not all. You also do it from the other direction so 0.9 would be:

        • 1/0.9
        • 1/0.99
        • 1/0.999
        • etc.

        The answer for the “limit of 1/x as x approaches 1” is 1. This is true because approaching 1 from both directions converge to the same number (even if they never actually quite reach it). Wolfram alpha agrees https://www.wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+1

        But, limits don’t have to converge to a number, they can also converge to negative or positive infinity.

        Limit of dividing by zero

        Now instead of converging on 1, let’s converge on zero. What is the “limit of 1/x as X approaches zero”?

        We would check 0.1 and go down:

        • 0.1
        • 0.01
        • 0.001
        • etc.

        And as it goes up:

        • -0.1
        • -0.01
        • -0.001
        • etc.

        The problem here is that coming from the top and going down the result approaches (positive) zero, and starting at the bottom and going up, the result approaches negative zero, which is a thing I promise https://en.wikipedia.org/wiki/Signed_zero. Since these values don’t converge to the same value, the answer is unknown it cannot be both answers, therefore division by zero (under this model) is unknowable and wolfram alpha agrees https://wolframalpha.com/input?i=limit+of+1%2Fx+as+x+-%3E+0.

        As a “well actually” technical correctness note. I’m explaining this as I intuit it versus in reality 1/x as x approaches 0 goes to positive infinity and negative infinity. I know it has something to do with taylor expansion, but I’ve been out of school too long to explain or remember why. Even so, my explanation is “correct enough” to convey the underlying concept.

        Wheel

        As you get into higher mathematics I find that it feels more philosophical than concrete. There are different ways to look at the world and if you define the base rules differently than you could have different mathematical frameworks (somewhat like, but not exactly the same, how there is “regular” and “quantum” physics).

        It looks like wheel said “negative and positive infinity aren’t different” which is quite convenient for a lot of calculations and then suddenly 1/x does converge to something when it goes to zero.

        1. 7

          limits don’t have to converge to a number, they can also converge to negative or positive infinity.

          “As a “well actually” technical correctness note”: if one is working in the real numbers, which don’t include infinities, a limit returning infinity is more strictly called “diverging” to positive or negative infinity. A limit can also “diverge by oscillation”, if the value of the expression of which the limit is taken keeps changing for ever, like sin(x) as x tends to infinity.

        2. 4

          What this says is that the limit of 1/0 is not defined, not that 1/0 itself is undefined. Consider sin(x)/x. If you do the math, the limit as x→ 0 is 1, but sin(0)/0 = 0/0.

          Another interesting example is the Heaviside function, x < 0 ? 0 : 1. Limit → 0 from the left is 0, limit from the right is 1, so the limit doesn’t exist. But the function is well defined at 0!

          1. 1

            You can express limits as approaching from a direction though, can’t you? So you can say that lim -> +0 is 1 and lim -> -0 is 0. It’s not that the limit doesn’t exist, but a single limit doesn’t exist, right?

            Why is this all way more fun to think about now than when I was taking calc 1 and actually needed to know it?

            1. 1

              Yeah, that’s right!

        3. 3

          Another weird limit argument is that the limit of x/x as x goes to zero is 1.

          1. 2

            To me, a more noteworthy one is that the limit of pow(x, x) as x approaches zero is 1. x/x = 1 for almost all values of x, but pow(x, x) = 1 is much rarer.

      3. 2

        At least from a cursory reading of the wikipedia page, you have basically traded x/0 being defined for 0*1 != 0, and x/x != 1.

      4. 1

        Yeah, you can also make a mathematical structure where division by zero is five. It’s not very useful though.

      1. 4

        You can in pretty much any programming language by using IEEE 754 floating point. It is clearly defined and will compute to infinity or NaN.

        1. 1

          It has to be special cased, otherwise in programming it would just be a endless loop.

          Subtracting 0 from $VARIABLE until it reaches 0, which will never happen.

          1. 3

            It probably is special cased, but also, people generally use much smarter algorithms for division than “subtract B from A until you get 0”.

          2. 2

            That algorithm doesn’t work for floating point numbers anyways. For example 2^64/1 would never complete because the closest double precision number to 2^64 is also the closest number to itself minus 1. So you can subtract 1 all day and it will keep rounding back to itself.

      2. 3

        J also allows it and so does JS, and in both cases the answer is more reasonably infinity.

      3. 2

        Aside: @hwayne, since you’re here, I’d like to feed back that I thought this was pretty confusing:

        […] I don’t know Pony, and I don’t have any interest in learning Pony.¹ But this tweet raised my hackles for two reasons:

        1. It’s pretty smug. I have very strong opinions about programming, but one rule I try to follow is do not mock other programmers.² […]
        2. It’s saying that Pony is mathematically wrong. […]

        Reading this linearly until hitting a footnote reference (“¹”) and then jumping down to where there was what looked like an inline footnote body, I thought you were saying you had no interest in learning Pony because Pony was too smug and mocked other programmers; only after reading the second (not-a-)“footnote” (Pony is saying Pony is wrong? huh?) did I catch my mistake.

        1. 2

          Well that’s a weird UX failure mode of footnotes + numbered lists! That footnote should go to this:

          In the year since I wrote this post, Pony added partial division. I still don’t know anything about the language but they’ve been getting grief over this post so I wanted to clear that up. [return]

      4. 1

        Previously discussed on Lobsters as: https://lobste.rs/s/ilwn5n/1_0_0

    9. 8

      A good read.

      Regardless of whatever funkiness was going on generally, one thing that was fairly consistent back then was the presence of hyperlinks and how they were used (rendered). What I find interesting about how this post is displayed (same style in other posts elsewhere on the blog too, to which I’ve subscribed, BTW) is how the hyperlinks … well, aren’t. At least, not in the way they used to be.

      As a reader, I actually enjoy text that isn’t littered with actual hyperlinks, it prevents me from following rabbit holes and lets me enjoy the prose in front of me.

      That said, given this bit: “There were no best practices …” I do remember way back then that there were best practices, some notably enshrined on a very informative and well respected website from Jakob Nielsen. One of these, IIRC, was to never use “this” (or similar) as the text of a hyperlink.

      So seeing “Whenever I see technology mughals like Peter Norvig or Rob Pike showcase their thoughts through ancient relics like this[2] or this[3]” I have to smile, (a) because these aren’t proper hyperlinks, and so (b) using “this” in these instances is OK :-)

      1. 8

        I have to disagree—I found it really distracting to get to the end of a word, follow the footnote reference to the bottom of the page, and then realize that I was just looking at the hyperlink that should have been in the body text to begin with. And then try to find where I left off reading.

        1. 2

          And then try to find where I left off reading.

          The “↩︎” links don’t work for you?

          1. 2

            I was reading on a large screen, so at some point both the article and the footnotes were visible.

        2. 1

          Yeah, I would’ve loved if the “footnotes” were just a link (not the words), but clicking once to follow the footnotes and once more to go to the link seems willfully archaic (or academic-paper-like? (or is that the same?))

      2. 3

        I might use IEEE citation style here: “showcase their thoughts through ancient relics like [2] or [3]” (with the numbers not as superscripts). To me it feels wrong to rely on footnotes for the meaning of the sentence (“like this or this”, without the “[2]” and “[3]”, would make no sense).

    10. 9

      I know NixOS has its problems, but it’s a real treat to see an OS release announcement and know there’s no risk in upgrading. OS rollbacks really change the upgrade risk calculus. Thanks, NixOS contributors!

      1. 8

        Realistically there’s no risk in upgrading until there is risk in upgrading and your system gets bricked in this way or another.

        1. 4

          I was going to reply that there is no risk at all, for all practical purposes, because of how easy it is to boot into previous generations. Then I upgraded and found my UEFI boot config was borked! At least I got a good story out of it, and will never be tempted to post any foolish replies for the rest of my life.

        2. 4

          Using both NixOS and ZFS (for which NixOS has good support) multiplies one’s defense-in-depth, but ultimately I agree there’s always some risk.

          1. 3

            There is currently a bug fix patch for ZFS you should definitely use, though I don’t think it’ll murder old data, it just messes with new data in some edge cases (there is a very good writeup on the ZFS release page on github)

          2. 1

            I was around some pro nix heads and I remember their systems being borked all the time.

      2. 1

        Out of interest, can you mention or link to some of these problems you’re referring to? I’m getting into NixOS myself and I wonder if there are obvious issues with it I might be missing.

        1. 2

          I meant in the general sense that nothing is perfect. If you’re interested in NixOS, I strongly recommend just taking the plunge, even if your first foray is in a VM.

    11. 1

      Doing [runtime detection] correctly is very hard and something I don’t particularly want to dig deeply into here.

      It’s kinda not that hard, especially not in Rust which gives you #[target_feature(..)] and std::arch macros (sadly doesn’t make it easy to use ifuncs instead of if statements but oh well the benefit from rtld-time resolution shouldn’t be that big performance wise)

      1. 1

        sadly doesn’t make it easy to use ifuncs instead of if statements

        There is ifunky. The much more widely used memchr does something similar, though I suppose that doesn’t count as easy.

        1. 1

          “without the need for loader support” yeah those aren’t really it either.

      2. 1

        Don’t you use cfg_feature_enabled for runtime feature detection, and target_feature for compile-time feature detection?

      1. 15
        session = run_cmds("cat /proc/self/sessionid".to_owned()).unwrap_or_default();
        

        Oh my god, this is almost daily wtf worthy. It’s “useless use of cat” to a level I wouldn’t have believed existed. What’s wrong with:

        session = fs::read_to_string("/proc/self/sessiondid").unwrap_or_default();
        

        EDIT: The other part that’s really shocking is the needless amount of additional allocation, i.e. using String, Vec, .to_owned() and .to_vec() everywhere rather than borrowing as &str or as a slice.

        1. 11

          This reminds me a lot of the Rust project I was on at my previous gig, especially all the unwrap_or_default(), checking if the string it just generated is_empty() instead of using an Option, useless use of cat and other stuff that is easily done from within Rust… I had to fix a lot of that.

          I think it’s a strong signal of a team of developers who are not very familiar with Rust, working under time pressure to deliver a product, being given features to implement with instructions on how to do it in shell and not having much time or incentives to learn how to do it properly.

          1. 8

            This reminds me a lot of the Rust project I was on at my previous gig, especially all the unwrap_or_default(), checking if the string it just generated is_empty() instead of using an Option, useless use of cat and other stuff that is easily done from within Rust… I had to fix a lot of that.

            Excessive of mut is another sin the codebase commits.

            I think it’s a strong signal of a team of developers who are not very familiar with Rust, working under time pressure to deliver a product, being given features to implement with instructions on how to do it in shell and not having much time or incentives to learn how to do it properly.

            It probably betrays a lack of understanding of the shell commands being executed as well (at least the awk/cat/ps wrappers—I can almost forgive wrapping loginctl in a rush to ship, as D-Bus APIs are often poorly documented and annoying to work with).

            I have to imagine if the developers had actually understood that the awk/cat wrappers were simply reading files and doing rudimentary parsing, then they would have at least tried to implement them in rust.

        2. 4

          EDIT: The other part that’s really shocking is the needless amount of additional allocation, i.e. using String, Vec, .to_owned() and .to_vec() everywhere rather than borrowing as &str or as a slice.

          I’m reminded of a piece of advice given in earnest to beginner Rust programmers, that if you’re having trouble dealing with lifetimes when getting a piece of code you’re writing to compile, just call clone() or otherwise go head and do the extra validations, in order to get the essential logic of your program working. After all, these extra copies and boxings of data structures are not really very different from what the runtime of an interpreted, dynamically-typed language like Python would be doing under the hood as a matter of course, in your particular code the extra overhead might not be meaningful anyway, and if it is you at least have the option of rewriting in to use more references and fewer clones once you’ve gotten the initial code working.

          1. 2

            That’s a good advice that I need to keep reminding myself, even after years of using Rust. Unless used in the most critical code paths, clones and allocations don’t really matter, while it’s easy to waste a lot of time trying to avoid them.

        3. 1

          That’s the kind of thing you typically only see in datascience python code!!!

          The number of

          os.system(“rm -rf ../../files/../idontknowwhy/../*”)

          type commands I’ve seen…

      2. 6

        Unsynchronized access to static muts in multithreaded async code

        I find it worth pointing out that, whereas the shell-ing out is ‘merely’ a (strong) code smell, this is prima facie Undefined Behavior.

      3. 3

        The shell stuff is especially odd, given how easy it is to do those things directly from Rust.

      4. 2

        While the shelling-out looks like just bad practice, the unsafe use does look severe. Reported upstream: https://github.com/rustdesk/rustdesk-server/issues/324

    12. 19

      If you remove information from a document, it compresses better. Who’da thunk?

      1. 6

        If you remove information from a document,

        IMO, title case removes information compared to sentence case, because it erases the distinction between proper nouns and common nouns. For example, I look to a Google news aggregator feed, which shows a list of articles with titles variously in title case and sentence case. Most seem to be in sentence case, but, for example, here is one Bloomberg article’s title:

        Eat Less Meat Is Message for Rich World in Food’s First Net Zero Plan

        From that title, I can’t know whether it means

        • “Eat less meat is message for Rich World in Food’s first net zero plan” (a group named Rich World in Food releases its first “net zero plan”, says to eat less meat), or
        • (with a typo) “Eat less meat is message for rich: World in Food’s first net zero plan” (a group named World in Food releases its first net zero plan, tells the rich to eat less meat), or
        • “Eat less meat is message for rich world in Food’s first net zero plan” (a publication named Food releases its first net zero plan, tells the rich world to eat less meat), or
        • “Eat less meat is message for rich world in food’s first net zero plan” (someone releases what is claimed to be the “first net zero plan” for the food industry, tells the rich world to eat less meat).

        I suppose the last is most likely, but I can’t know without opening the article. Of course, if it does get me to open the article, then it was useful to the publisher, but still, the title-cased title conveyed less information than a sentence-cased title would have.

        1. 5

          It’s worth noting that title case is primarily a US thing. Either the Cambridge nor Oxford styles recommend it, for precisely the reasons that you describe. I can’t remember if both New York and Chicago styles include it. The BBC style guide is often incorrectly applied by their own journalists in a similar way. They recommend using initial capitals for acronyms but they also apply the rule to initialisms (which their style says should be all caps) leading to weird things like Pc (or, horribly inconsistently, IBM Pc).

      2. 5

        Switching from title case to sentence case doesn’t really remove information. This post shows how it still can have an impact

        1. 9

          I Strongly Disagree.

        2. 4

          A good question then is then if you can make some js that in less than 31 bytes (compressed) will add title case back

            1. 9

              Capitalising the first letter of every word isn’t the same thing as title-casing, as that page says:

              Note: Authors should not expect capitalize to follow language-specific title casing conventions (such as skipping articles in English).

              1. 1

                If you separate each sentence into tags you can use ::first-letter.

                On the note of language, noted, can use data attrs to set the language & change capitalization strategy accordingly.

      3. 1

        Depends on what you remove:

        • d1: abc abc abc abc
        • d2: abc abc abc ac
    13. 3
      • ZFS, which I can trust to store my data, unlike most filesystems
      • GNU Emacs, the Extensible, Self-Documenting Editor, which seems to manage to adopt features from more modern IDEs while keeping in touch with its traditions
      • The Rust implementation (even if somewhat grudgingly), for being good at catching my mistakes without making me face the theoretical complexity (or unfamiliarity) of e.g. Haskell (though I respect Haskell and the proof languages)
      • Some proprietary software that’s widely used but pretty unpopular among the kinds of people who post here :-)
      1. 1

        Some proprietary software that’s widely used but pretty unpopular among the kinds of people who post here :-)

        Is it Windows!? ChatGPT? Now I’m curious :-)

    14. 14

      I too would like this problem to be addressed, but I cannot agree with this article.

      The title says “C Doesn’t Want to Fix It” and the introduction alleges that “many people who could influence the standard or maintain the C libraries don’t see this as a problem”, but these accusations are not supported with evidence.

      The introduction goes on to say (emphasis added)

      the specification clearly documents that setenv() cannot be used with threads. Therefore, if someone does this, the crashes are their fault. We should apparently read every function’s specification carefully, not use software written by others, and not use threads.

      …which, at least given its lack of citations, seems to me like shameless strawmanning of the C people.

      This article does cite sources that show that C people declined to implement one proposal for addressing this problem, “Annex K”, but to go from “They decided not to implement this one proposal” to “They don’t want to fix it” or to “They don’t see this as a problem” seems like… something like the Politician’s Fallacy. The Politician’s Fallacy is “We should do something about the problem; this is something; therefore, we should do this”; this article seems to be saying “We should do something about the problem; this is something; they refused to do this; therefore, they refuse to do something about the problem”.

      Furthermore, although this article acknowledges in passing that the C people have specific criticisms of Annex K, this article — while presenting implementing Annex K as the right thing to do and thus, implicitly, presenting the C people as wrong to decline to do it — does not appear to attempt to argue against or otherwise engage with their reasons for declining to implement Annex K.

      1. 4

        At the risk of sounding like “those C people”, I don’t understand why any solution for environment thread safety at the language or language runtime layer would be a desirable choice in the first place.

        I’ve been bit by setenv not being thread safe a handful of times but every time it’s happened, there was a very good case to be made that achieving whatever I wanted to achieve by modifying the environment was a poor choice in the first place, thread-safe or not. Need to configure library behaviour in a specific context? Great – pass that configuration through its libfoo_context configuration context or whatever. Need to send arbitrary data, like DNS server information between processes or whatever? There are proper IPC mechanisms for that.

        On the one hand, thread unsafety is a creeping bug. It gets into your code via third-party dependencies – like third-party libraries that can only be configured through some environment variable – so you’re not aware until it’s too late, and you can’t do much to fix it. So I understand the drive to fix it.

        But on the other hand it feels like a whole lot of work (remember Annex K?) that both works around poor program and library design choices and continues to enable the incorrect behaviour. It kind of feels like barging in on the International Federation for Stupid Games with a big box of goodies so that the athletes who take part in the Stupid Games World Cup don’t have to win stupid prizes. I kind of get why the folks who write compilers aren’t particularly eager to fix it that way.

        Frankly, I don’t think the environment should be locally mutable in the first place – if you need to pass a modified version of your environment to a process that you exec, you should just be able to clone your read-only environment into a mutable copy and hand it over. It’s a scheme that also works if you need to use separate envs in threads of the same program – not sure why but… okay? – except with thread-local storage. I secretly suspect, based on the wording in the older standards, that it wasn’t even meant to be mutable in the first place. I guess that ship has definitely sailed as far as Unix is concerned but who knows, maybe Plan 10 will fix that :-).

        1. 1

          It’s a scheme that also works if you need to use separate envs in threads of the same program – not sure why but… okay? – except with thread-local storage.

          We do exactly this in build2: a per-thread mutable environment that sits on top immutable per-process environment. This is used to implement hermetic build configurations where a thread needs to switch to the environment of the project it’s working on. We could have dragged the explicit environment through the multiple layers of the API, but that would be quite painful, especially seeing that no code except that which starts processes needs to touch it.

      2. 3

        Annex K was for C11. Since then we had C17 and almost C23. If the issue was not fixed in the follow up standards, is it really wrong to say they don’t care? The opportunities were there and the issue remains. It’s also a smaller scope problem than “fix the API for bounds checking in the whole world”.

        1. 6

          Annex K was stillborn, because even its sponsor (Microsoft) doesn’t implement it according to the standard. It’s a really bad design that doesn’t guide the programmer away from error-prone patterns. I am doubtful about the wisdom of a global trap handler that allows you to suppress buffer overflow traps.

        2. 2

          It probably can’t be fixed. You need to find a good solution, and you need to make sure it is easy enough for the 100 odd C implementations out there to to adopt it. Otherwise it won’t go in the standard. And even then it won’t really be fixed since code compiled under the old standard would need to continue to work with newly compiled code. Hell, the fix might be a new API, so the old behavior might need to continue to be accessible and terrible forever.

      3. 3

        I think youºre being a bit unfair.

        You write

        while presenting implementing Annex K as the right thing to do and thus, implicitly, presenting the C people as wrong to decline to do it — does not appear to attempt to argue against or otherwise engage with their reasons for declining to implement Annex K.

        From the article:

        My understanding is the people responsible for the Unix POSIX standards did not like the design of these functions, so they refused to implement them. For more details, see…

        I haven’t looked at the rest of the functions, but having spent way too long looking at getenv(), the general idea of getenv_s() seems like a good idea to me. Standardizing this would help avoid this problem.

        To me, it’s pretty clear the author is not arguing for Annex K, but is arguing that the getenv_s function and setenv_s function (or something similar) would be a significant improvement over the status quo.

        Standard writers could pick and choose from Annex K without implementing the whole thing. It looks to me like the three criticisms linked do not specifically mention getenv_s or setenv_s, just Annex K, considered as a whole.

        Unless there are other efforts from contributors to the standard to fix this in the past dozen years, I think it’s reasonable to say they don’t care much.

        1. 6

          Standard writers could pick and choose from Annex K without implementing the whole thing. It looks to me like the three criticisms linked do not specifically mention getenv_s or setenv_s, just Annex K, considered as a whole.

          This was like 10+ years ago already so I may be misremembering some parts but I don’t think it’s that easy, unfortunately.

          This blog post makes it sound like it’s super easy, just implement getenv_s and setenv_s and that’s it. IIRC it’s not that straightforward.

          First off, ignoring all other aspects about whether that would be a good design in the first place, getenv_s and setenv_s are not thread-safe in a typical default environment, either. The runtime constraint handler, which handles constraint violations for _s-family functions, is process-global. IIRC programs can define their own constraint handlers and you can probably work thread safety in that but, in addition to not being at all clear if that’s a good idea in general, you’re still stuck assembling bounds checking error handling code from Lego bricks and still have to make sure it’s thread safe. It’s a much harder problem. If you have the kind of time you need to do that kind of crap well, you have more than enough time to work around whatever env setting problems you have, which is also a far easier problem..

          This is actually the part of the design that “the people responsible for the Unix POSIX standards” hated the most. The function interface itself could probably be salvaged to some degree. However, you can’t carve out just these two functions and implement them – their spec is not implementable as-is, in isolation from everything else in Annex K, because they rely on the runtime constraint handler for some of the error signaling.

          Furthermore, these two functions aren’t “their own thing” in Annex K, they’re part of a whole series of functions with a unified (eh) bounds-checking and error handling interface which were devised to work together. They don’t make much sense unless you pair them with the rest of the API in Annex K. Implementations that pick just these two functions are pretty likely to just saddle everyone out there with coming up with an ad-hoc implementation of various other bits and pieces of Annex K.

          The idea behind them per se is okay (getenv_s basically gives you a copy of the environment variable you’re requesting) but the interface of both functions is actually waaaay more complicated than this post makes it sound, too. E.g. getenv_s gives you a copy in a buffer you provide, which means you get to figure out how big it’s got to be. You do that by calling it with a NULL destination buffer. It’s a pretty long-winded API that results in a lot of nasty copy-paste bugs. Even if you take the RCH out of the equation and bless these functions with proper (and direct) error signaling, it’s still an API that’s very difficult to use correctly and has a lot of pitfalls both in the interface and in the underlying implementation.

          It’s a super fragile mechanism that turned out to usually result in about as many – but far more subtle – bugs than non-threadsafe {get|set}env calls. It’s just not much better than just not doing {get|set}env in a threaded environment in the first place.

          1. 1

            Thanks, this is very helpful.

            I wonder to what extent these are separable concerns? By that, I mean, can you have an api like getenv_s, setenv_s without opting into all of the complexities of Annex K? I also don’t know whether OP perceived it that way or not. The article makes it sound like he thinks you can pick and choose what to bring in.

            1. 2

              I wonder to what extent these are separable concerns? By that, I mean, can you have an api like getenv_s, setenv_s without opting into all of the complexities of Annex K?

              Well, first off, I definitely misremembered at least part of it, setenv_s is not a thing :-). I could have sworn Annex K included a counterpart to getenv_s but I’m probably thinking about a completely different initiative. In software years this was like 70 years ago, sorry :-D.

              A mechanism similar to getenv_s‘s core mechanism – giving you a copy of the requested environment variable’s value – is probably possible. @borisk mentioned an implementation based on currently-available primitives that does basically that here. While I’ve never had to implement per-thread envs myself I think that’s how I would’ve done it, too.

              If you want to implement getenv_s‘s API exactly, then you can’t decouple it from the rest of Annex K – the standard lists a number of runtime constraints that have to be handled globally, through the RCH.

              Deferring getenv_s‘s runtime constraint checks to its implementation, and signaling them through direct error codes, is probably possible, but then you’re still stuck with an interface that has a really bad design. Unless you make a process’ env list immutable (at which point you’ve already kind of solved the problem and don’t need another API in the first place), you have an interface that’s difficult to implement without leaking the handling of potential TOCTOUs to the user (and we all know how that ends), and a serious vector of bugs due to its two-step call sequence (call once with a NULL buffer to get the length of the value, call again with a proper buffer to get the value).

              IMHO implementing a modified getenv_s, with direct error handling and without the RCH, would just replace one class of bugs with another. It’s not the right way to go. The author of this blog post is absolutely right to think access to environment variables should be thread-safe but a getenv_s-like API is not a good way to do that.

              And more generally, I think this is not the right layer to solve this problem at – IMHO the “correct”, but undesirable fix is making the environment immutable in the first place. I don’t think this is a problem that can be fixed without leaking the very hairy problem of handling environment mutability to the caller, at which point you’re just ticking the “thread-safe” box – sure, you have a thread-safe API, but using it correctly in a threaded environment is about as hard, if not harder, than working around the original thread-unsafeness of the old API, and asking people to devise new workaround for a new bad API when there’s already an established set of workarounds for the old bad API isn’t very productive.

        2. 2

          I wrote:

          Furthermore, although this article acknowledges in passing that the C people have specific criticisms of Annex K, this article — while presenting implementing Annex K as the right thing to do and thus, implicitly, presenting the C people as wrong to decline to do it — does not appear to attempt to argue against or otherwise engage with their reasons for declining to implement Annex K.

          hyperpape wrote:

          I think youºre being a bit unfair. […] To me, it’s pretty clear the author is not arguing for Annex K, but is arguing that the getenv_s function and setenv_s function (or something similar) would be a significant improvement over the status quo.

          I think a crucial sentence that led to my understanding of the article as advocating for Annex K may have been this, in the section on Annex K:

          Standardizing this would help avoid this problem.

          Rereading, I agree that I likely was wrong to interpret that “this” as referring to Annex K and thus misrepresented the article, for which I apologize.

          However, rejections of Annex K are the only rejections of a solution to which this article links. Therefore, if I reread with the understanding that the article is advocating for a specific change other than implementing Annex K, it now seems to me that the article cites no evidence for the rejection of any solution that the article would want, despite its claims that “C Doesn’t Want to Fix It” and “many people who could influence the standard or maintain the C libraries don’t see this as a problem”. There is the argument that, if they don’t fix it (and they haven’t), then that means they don’t see it as a problem, but, whether or not I agree with that argument (I don’t), I don’t think it is an argument that the article actually makes.

    15. 4

      Who and what in the world wants to stick to RFC 3986 though, when it’s an imperfect, old and unmaintained standard?

      Disclaimer: I have a browser bias and contribute to whatwg things.

      1. 26

        I also maintain a standards compliant URL parser, and I too stick with RFC 3986. A few years ago I looked into the WHATWG spec and came back shaking my head. The RFC might be old and “imperfect”, but from an implementor’s point of view, it’s a damn sight better than the WHATWG spec.

        That blog post focuses on the grammar, lost rigor and difficulty of implementing the spec, but I have another gripe with the WHATWG spec, and that there is no explicit versioning and availability of older documents. Since it is a “living standard”, claiming compliance for a URL parser has no real meaning, because tomorrow it might not anymore. It’s a moving target. Like trying to build on quicksand.

        IMO, standards should be clearly versioned (beyond “just use the git repo”), so that if a bug is claimed, one can go back to the document that was used as a guide, and verify whether it really is a bug, misunderstanding, etc. Then, if it’s only a noncompliance with a newer version of the standard, one can decide whether to “upgrade”.

        1. 4

          Thanks for the detailed reply. Your blog post was helpful in understanding your position. The mentioned presentation “The Science of Insecurity” is one if my all-time favorites, I have watched it many times :)

          I agree that these inconsistencies are truly dangerous and I would like them not to be. Maye it’s the arrogance of a browser person speaking, but the web is evolving. I understand that it feels like quicksand, but there’s no easy way to participate in an ecosystem that is moving fast. So, even if there was a formal grammar (and I agree, there should be!), URL parsers need to align with browsers.

          The WHATWG URL Standard is maintained and RFC 3986 not. Sticking your head into the sand and pretending that the latter is preferred for some sort of algorithmical purity is only going to lead more software breaking due to parser differentials. At this point, I suggest engaging with the WHATWG standard: Implement it, improve it, follow it.

          There was a nice paper with the title “yoU aRe a Liar://A Unified Framework for Cross-Testing URL Parsers” that was accepted at the 2022 IEEE Security and Privacy Workshops (SPW), which I can recommend for reading. I don’t agree with everything but it shows how terribly widespread the issue.

          1. 21

            The WHATWG URL Standard is maintained and RFC 3986 not. Sticking your head into the sand and pretending that the latter is preferred for some sort of algorithmical purity is only going to lead more software breaking due to parser differentials.

            Or maybe not, since even high-profile libraries like curl are sticking with the RFC, as it is a stable standard that doesn’t change out from under you. I think you’ll find most URL parsing libraries in programming languages refer to the RFC and not the WHATWG spec. And it’s not just niche languages like CHICKEN and Haskell (of which our lib started as a port). For example, the docs for Python’s urllib.parse, PHP’s parse_url, Java’s URL class and Go’s net/url all only refer to the RFC. No mention of WHATWG in sight. Of all the more mainstream languages, it seems only the NodeJS URL library mentions WHATWG.

            The fact that the spec is evolving and “moving fast” is making the problem of parser differences worse. If you’re on a slower release cycle, you can’t really target such a moving standard all that well anyway. All in all, if more parsers stick with the stable RFC, that would actually mean there will be less breakage. And in fact, so far I haven’t seen any problems in practice with having a server accept only RFC-correct URLs. The WHATWG spec’s changes seem mostly concerned with “fixing” garbage user input (like removing spaces etc), which is only really relevant for user agents anyway. Servers should be strict and reject such garbage.

            Finally, given how many URL parsers that claim to implement the RFC actually break on edge cases, it’s very likely that this is also true for parsers that claim to implement the WHATWG spec. Maybe even more true because it’s harder to verify compliance due to being less formal.

            As for me, I now have a family, so don’t have the time anymore to work on the URL library that much. Life is too short to keep up with a living standard that doesn’t even publish versions. From my perspective, the library works (quite well, actually) and if bugs are found I’ll fix them. I would even stick my hand in a fire and claim that it’s less buggy than 90% of URL parsers out there (as in “not conforming to the spec”).

            At this point, I suggest engaging with the WHATWG standard: Implement it, improve it, follow it.

            I tried to get the BNF reinstated, as have others. The WHATWG doesn’t care. Worse, it is actively arguing that not having a BNF is somehow “better”. My blog post was an attempt at trying to get through to people, after the GitHub issue(s) was/were just ignored.

            1. 4

              the docs for Python’s urllib.parse, […] only refer to the RFC. No mention of WHATWG in sight

              The page you link to mentions whatwg 4 times, which is only 2 less than rfc 3986.

              From experience it also follows neither. Notably the main “urlparse” function is a complete trap as it follows RFC 1738 but nothing really tells you outright.

            2. 4

              Of all the more mainstream languages, it seems only the NodeJS URL library mentions WHATWG.

              I don’t know how mainstream you’d consider it, but Rust’s de-facto standard URL parsing library follows WHATWG, because that library was written for use in a Web browser context (Servo).

              I have had to wrap it to stop it from accepting stuff like “https://example.com/ is a URL but this isn’t” as a URL when other code was relying on it to reject that, but I don’t fault the library for that; it serves its original purpose.

              1. 11

                Interesting, I didn’t know Rust’s de facto URL parser was based on that spec. It looks like they have their work cut out for them, though: https://github.com/servo/rust-url/issues/850

                Especially this sounds like a problem you’d only have when following the WHATWG spec:

                we can figure out what exact new tests we are failing on. We should then determine which test failures relate to which spec changes, and open individual issues for implementing those changes

                I don’t envy them!

          2. 14

            RFC 3986 […]’s an […] old and unmaintained standard

            Maye it’s the arrogance of a browser person speaking, but […] there’s no easy way to participate in an ecosystem that is moving fast. […] The WHATWG URL Standard is maintained and RFC 3986 not. Sticking your head into the sand

            I speculate that this may be a key point of difference of values: I, for one, don’t think it’s right for something as fundamental as the way of identifying/naming/locating resources on the Internet, as used by pieces of software to communicate with each other (as opposed to users), to be changed at all (except maybe once or twice per century), let alone to move fast.

            1. 1

              I think here is where we disagree. I strongly believe that you can’t have software that talks to the internet to be stable and secure. If you want it secure, keep updating.

              1. 20

                There’s a huge difference in updating to fix security bugs and changing the protocol itself. The latter should only really be necessary if there’s a fundamental flaw in the protocol.

              2. 12

                Can you explain how RFC3986 is insecure? That’s the first I heard of this.

                I’ve heard about parsing differences between implementations leading to security problems, but that’s not an issue that another spec helps to solve, let alone a rapidly evolving spec.

                1. 1

                  I’ve never said that RFC3986 is insecure. I’m saying that software needs to take into account that the world isn’t static.

                  Edit: The website is telling me that I should not comment on the same thread every minute and it’s right. I don’t think we’re going to end the discussion in a way where everyone happily agrees with one another. I’ll see if I can find some time to join the IRC channel later today.

              3. 7

                Arguably, the WHATWG standard is exacerbating this by opting to not include a normative formal grammar. Instead, it indirectly specifies the URL langauge in terms of a rather concrete stateful parser implementation in pseudo code. This makes it much more susceptible to security issues than an unambiguous formal grammar and thus requires updating it more frequently. With a formal grammar, most of these issues are relegated to the parser implementations. And since a grammar allows re-use of battle-tested parser libraries/generators, this should result in more robust implementations overall.

          3. 8

            There was a nice paper with the title “yoU aRe a Liar://A Unified Framework for Cross-Testing URL Parsers” that was accepted at the 2022 IEEE Security and Privacy Workshops (SPW), which I can recommend for reading. I don’t agree with everything but it shows how terribly widespread the issue.

            That paper is available from one of the authors’ website: https://kapravelos.com/publications/youarealiar-secweb22.pdf

            An example from page 6:

            Most real-world applications have a huge code-base, modularized into multiple files, folders, and even applications in the case of a distributed application. Along with that, application developers keep changing and rotating over time.

            In such organizations, it’s often difficult to maintain consistent usage of third-party libraries across modules. In this PoC attack, we assume two separate modules - one is a utility module (Listing 1) which checks if a given URL is valid using urllib3, and the other module implements a file download feature using requests which depends on urllib (Listing 2).

            By passing specially crafted URL, e.g., http://example.com:80@localhost:8080/secret.txt, an attacker can pass the checks and read secret.txt from localhost.

            The link one post up by sjamaan actually talked about that same problem in 2018:

            Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let’s say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer.

            If these are all ad hoc “informal” parsers that try to “fix” syntactically invalid URLs, it is nigh-impossible to verify that filtering them for “safe” URLs is correct. That’s because it’s impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

          4. 7

            gopher: is a properly valid scheme for a URL, yet the WHATWG URL standard omits it. Just going by the WHATWG [1] standard, it seems that the gopher: URL has a non-informative host (false) or no IPv4 address (false). There are other URLs that follow a standard format of scheme://host:port/path/?query#fragment (like redis: and I’m sure others), but it to me it seems that the WHATWG is putting its head in the sand and going “La la la la la la there are no other URLs la la la la la!” I work with gopher: and gemini: [2] URLs, and I would have to give that up if I used a WHATWG URL parser (or ignore the “opaque host” crap and somehow work around gopher: URLs with IPv4).

            [1] I want to read that as “WHAT Working Group” which in my opinion, seems appropriate.

            [2] It’s a URL type used in the wild, I’m just saying.

            1. 3

              that kind of thing is also an issue for curl, given the wide range of protocols it speaks that browser don’t care about

            2. 2

              Great point, actually. Quite a blindspot from me. The WHATWG standard is truly very dismissive about non-browser use cases, which includes non-browser protocols. Indeed, it is and maybe should be useless for other protocol schemes.

              I guess e.g., the gemini protocol should be defined in gemini land?

              1. 3

                I guess e.g., the gemini protocol should be defined in gemini land?

                It is. It’s based off RFC-3986.

              2. 1

                I think it makes sense: in essence the whatwg spec claims the informal URL subset of URI as the schemes browsers care about, and it specifies that.

                Other schemes are simply out of its scope and purview. If you want to know how to parse a gopher URI, ask check with whoever specifies gopher.

                1. 7

                  Which is then a fairly strong reason not to want a strict WHATWG following parser, and since the subthread started with asking why anyone would not want to use WHATWG…

                  Also afaik WHATWG claims their URL spec aims to be a replacement of the existing URL specs, not just describe a subset?

                  1. 1

                    afaik WHATWG claims their URL spec aims to be a replacement of the existing URL specs, not just describe a subset?

                    Yes, that’s stated in https://url.spec.whatwg.org/#goals and clarified in GitHub issue 703.

      2. 9

        One advantage would be that it’s a single exact standard, whereas WHATWG URLs aren’t, as TFA mentions. A standard being unmaintained isn’t a problem if it’s done (and all else being equal, a standard is better if it doesn’t change).

        What imperfections are there? Are they significant enough to warrant changing it?

      3. 5

        Over the years, Daniel has written a bunch about his issues with various URL standards and the attitudes behind their development.

    16. 63

      I love this website. First thing I see? Code. Amazing. It is insane how some languages hide their code from me.

      And then I scroll down and… a repl? OMG. Yes.

      And then I scroll down more and there’s more code??????? Did an actual developer build this website????

      I remember when the Rust website was good and actually showed you code. Now it’s useless. https://www.rust-lang.org/ versus https://web.archive.org/web/20150317024950/http://www.rust-lang.org/

      I feel like you’ve really hit the best of “simple, stylistic, and shows me the fucking programming language”.

      1. 31

        They’re clearly marketing to different audiences. Rust doesn’t need to sell as a technology to engineers anymore, it needs to sell as a product to businesses. Most tech marketing sites undergo this transition the more they grow, in my experience

        1. 11

          I think that’s not true at all. What does it that even mean, selling as a product to businesses? The way you do that is by targeting their engineers, the people who make technology choices. Ground up is by far the best strategy for a language, top down feels ridiculous. If you are actually in a position where you’re talking to bunsinesses about funding you’ve already sold to their engineers and your website is irrelevant - at that point they’re going to want to hear about what the language needs in order to succeed because the org is bought in.

          Beyond that, this website contains all of the things that the Rust site does, but also code. It’s proof that you don’t need a tradeoff here - literally just put a damn snippet of the language on the site, that’s it.

          The reality is that language decisions are made almost exclusively when the company starts. By engineers. Companies only ever constrain their internal support languages in the future, and it’s engineers who push for language support when that happens. Selling a language to a business through the website is just a nonsensical concept on its face.

          Roc makes it clear, in text, without code, “here is why you would choose Roc” but also it shows code, very concisely right at the top.

          1. 20

            Cool! I think it’s true at all :)

            The idea is that rust has already captured he interest of engineers. Word of mouth, blog posting, conferences etc. “Everyone” knows about rust at this point. Engineers are building hobby projects and FOSS in jot already.

            What I mean by selling to a business is that ultimately management aka business needs to approve and buy into rust otherwise it will remain in hobby project obscurity.

            Some engineer or group of engineers says to their boss “we want to use rust for project X” and their boss loads up the landing page because while they’ve heard of Java and python they haven’t heard of rust. This is why the first words you see are ”reliable”, “efficient”, “safe”, and “productive”. They see big recognizable company logos and testimonials by well known or high-positioned leaders in the industry.

            For what it’s worth I also think it’s a good idea to put code on the homepage. You can do both, but this is why you see tech marketing sites evolve to become less and less technical over time.

            1. 5

              The idea is that rust has already captured he interest of engineers.

              The reality is that Rust is something engineers have heard of but largely know nothing about, other than some high level stuff like “fast”. A small portion of companies actually use Rust, a small portion of engineers have ever touched it.

              Some engineer or group of engineers says to their boss

              By far the easiest way to gain adoption is through startups or smaller companies where the engineers are already the ones who decide on language. After a few thousand employees companies always end up with someone locking down the decision making process and tiering the language support levels - at that point, again, it’s up to the engineers to work through the process to gain support. I’ve never heard of a situation where a top level manager had to go to a language website and make the call to use or not use a language, isn’t that sort of an absurd idea? But, again, Roc shows that this is not an “either, or” situation.

              This is why the first words you see are

              Roc shows “fast, friendly, functional” - I’m not saying you can’t have words. Roc shows that you can have these great, high level blurbs here.

              They see big recognizable company logos and testimonials by well known or high-positioned leaders in the industry.

              Roc also shows this. Again, I am not against the idea of content other than code on the site, I’m saying it’s ridiculous to have literally no code.

              For what it’s worth I also think it’s a good idea to put code on the homepage.

              Then we agree on the main issue here.

              1. 10

                I’m not debating your tastes here? Just explaining why sites follow this trend. Hope this helps

                1. 3

                  OK and I’m explaining why the reasoning isn’t sound and it’s a bad idea.

              2. 5

                I think the point is that Rust is at a stage where it’s marketing itself as an industry-ready language and ecosystem that you can build actual products on without worrying about running into a horrible bug that dooms your start up. At that stage, showing code doesn’t make too much difference, because the “industry-ready” aspect of it has so much more going for it than the surface language syntax you can showcase in a landing page.

                Roc, however, is at the stage where it needs to capture the interest of curious engineers that might be interested in the language itself, without worrying about the practical applicability of it too much. (BTW, I don’t want to imply that Roc can’t be used for anything practical at this point, I don’t know, but clearly I wouldn’t bet my start up on it and I doubt anyone expects that at this stage).

          2. 14

            What does it that even mean, selling as a product to businesses? The way you do that is by targeting their engineers, the people who make technology choices. Ground up is by far the best strategy for a language, top down feels ridiculous.

            Man, as a person who sells languages (mostly TLA+) professionally, I wish that was the case. Selling to engineers is way more fun and feels more comfortable. But I’ve had cases where an entire team wanted to hire me, and cases where the team never heard of TLA+ but a manager or CTO wanted to hire me, and the second case way more often leads to business.

            1. 3

              Which is the norm. Because our peers unfortunately mostly have very little power to make decisions.

            2. 2

              I’m not saying that it literally never happens, I’m saying that by far language adoption at companies is driven by engineers - either through external forces (engineers building tooling/libraries in their free time, engineers starting companies with that language, etc) or through direct forces (engineers advocating for the language at that company).

              I’m sure there are plenty of extreme cases like Oracle convincing schools to sell Java, or whatever, where this wasn’t the case.

              It’s all moot since Roc’s language clearly shows that a website can have code and high level information, and meld them together well.

              1. 1

                I’m not saying that it literally never happens, I’m saying that by far language adoption at companies is driven by engineers

                The guy literally said he sells languages professionally, and he’s telling us it’s driven by management, not engineers. Do you also sell languages? Or better yet, have some source of information which trumps his personal anecdotes?

                In what way is Oracle an extreme case?

                I don’t disagree that you can still include code, but even the high level blurbs will be different when selling to businesses, at which point, if the copy isn’t selling to developers, you might as well better use the space to more effectively sell to businesses where the code would have been (even literally empty space for a less cramped, more professional looking page).

                Engineers care more about “fast (runtime)” and certainly “friendly” than most businesses do, and businesses care a lot more about “safe” than a lot of programmers do; “fast (compile time)”, ”reliable”, “efficient”, and “productive” may be mixed. Engineers also care about technical details like “functional” a hell of a lot more than businesses do, because they can evaluate the pros and cons of that for themselves.

                I guess it’s much more effective to be very targeted in who you’re selling to, unless both profiles have equal decision making power.

                1. 1

                  The guy literally said he sells languages professionally, and he’s telling us it’s driven by management, not engineers.

                  And? I know almost nothing about them. They’ve apparently been pushing TLA+, an incredibly niche and domain specific language that the vast majority of developers never even hear of and that targets software that lives in an extreme domain of correctness requirements. It’s nothing at all like Rust. Maybe they also sell Python, or comparable languages? I don’t know, they didn’t say. Is it even evidence that selling TLA+ this way is the most effective way?

                  Or better yet, have some source of information which trumps his personal anecdotes?

                  Years of experience in this field as an engineer, advocating and getting a large company to include languages in their accepted policies, generally just being an engineer who has adopted languages.

                  unless both profiles have equal decision making power.

                  I do not buy, at all, that for a language like Rust (frankly, I doubt it’s the case for any language, but I can imagine languages like TLA+ going a different route due to their nature and lack of appeal to devs) that it’s not wildly favorable to target developers. I’ve justified this elsewhere.

          3. 5

            The current Rust website is not ideal, but I think the general principle of convincing devs vs businesses holds.

            Developers are picky about surface-level syntax. The urge to see the code is to judge how it fits their preferences. I see lots of comments how Rust is plain ugly and has all the wrong brackets.

            Business-level decision makers don’t care how the code looks like. It can look like COBOL, and be written backwards. They ask what will happen if they spend resources on rewriting their code, and that better improve their product’s performance/reliability/security or devs’ productivity, not just make a neat-looking quicksort one-liner.

            1. 2

              Of course the one who cares about code readability is the one who is going to read the code. What’s wrong with that?

              1. 3

                Readability is subjective. Syntax must be functional and clear, but devs will care about details beyond that, and die on hills of which brackets it should use, and whether significant whitespace is brilliant or terrible.

                Compare “The app is slower and uses more memory, but the devs think the code is beautiful” vs “The app is faster and more reliable, but the devs think the code’s syntax looks stupid”. If you want to convince product owners, the syntax is an irrelevant distraction, and you must demonstrate outcomes instead.

                In web development there’s also a point of friction in developer experience vs user experience. Frameworks and abstraction layers have a performance cost (tons of JS), but not using them is harder and more tedious.

                1. 2

                  While readability is subjective to a certain degree, it stems from human psychology, which is based on how the real world works.

                  In the real world, things that are inside a container are smaller than the container, so if delimiters don’t visually encompass the text they contain, it looks counter-intuitive.

                  And things which are physically closer together are perceived as being more strongly associated, so a::b.c looks like it should be a::(b.c), while in fact it’s (a::b).c.

                  Performance and safety has nothing to do with syntax. It would be entirely possible (and not difficult) to create a language exactly like Rust, but with better syntax.

                  1. 4

                    I think you’re confirming what I’m saying. You’re focusing on a tiny nitpick about a part of the syntax that doesn’t cause any problems.

                    And yet you’re prepared to bikeshed the syntax, as if it was objectively wrong, rather than a complex tradeoff. In this case the syntax and precedence rules were copied from C++ to feel familiar to Rust’s target audience. If you just “fix” the precedence, it will require parens in common usage. If you change the separator symbols, you’ll get complaints that they’re weird and non-standard, and/or create ambiguities in item resolution.

                    Showing code for a language invites such shallow subjective opinions, and distracts from actually important aspects of the language.

                    1. 1

                      You’re focusing on a tiny nitpick about a part of the syntax that doesn’t cause any problems.

                      Code readability is a problem, especially in a language that frequently features complex declarations.

                      And yet you’re prepared to bikeshed the syntax, as if it was objectively wrong

                      It is objectively wrong. < and > are comparison signs, not brackets. (And Rust does also use them as comparison signs, creating even more confusion.)

                      If you change the separator symbols, you’ll get complaints that they’re weird and non-standard

                      I think using \ instead of :: would be fine in terms of familiarity. Everyone has seen a Windows file path.

                      1. 5

                        This is literally bikeshedding. Don’t you see the pointlessness of this? If you changed Rust to use your preferred sigil, it would still produce byte for byte identical executables. It wouldn’t change anything for users of Rust programs, and it wouldn’t even change how Rust programs are written (the compiler catches the <> ambiguity, tells you to use turbofish, and you move on).

                        Rust has many issues that actually affect programs written in it, e.g. bloat from overused monomorphisation, lack of copy/move constructors for C++ interop and self-referential types, Send being too restrictive for data inside generators and futures, memory model that makes mmapped data unsafe, blunt OOM handling, lack of DerefMove and pure Deref, lack of placement new for Box, etc. But these are hard problems that require deep understanding of language’s semantics, while everyone can have an opinion on which ASCII character is the best.

                        Even Rust’s syntax has more pressing issues, like lack of control over lifetimes and marker traits of async fn, or lack of syntax for higher-ranked lifetimes spanning multiple where clauses.

                  2. 1

                    You’re thinking of physics, not psychology. Concrete syntax is one of the least important parts of a programming environment, and we only place so much weight onto it because of the history of computers in industry, particularly the discussions preceding ALGOL and COBOL.

      2. 5

        And they actually explain what they claim with “What does X mean here?”.

      3. 3

        https://web.archive.org/web/20150317024950/http://www.rust-lang.org/

        There is also https://prev.rust-lang.org, linked (not prominently) from the bottom of https://www.rust-lang.org. It doesn’t have the “eating your laundry” footnote, but the code runner works, unlike in web.archive.org.

    17. 26

      Although the sysadmins couldn’t use the DST mechanism in Windows, people still wanted the computers’ clocks to match their wall clocks, so we had to actually move the machine’s internal clock.

      All this “We don’t do DST” sounded reasonable enough, until this. :-)

      1. 5

        Windows (at least 95 and 98) used to do this itself - modify the hardware clock to adjust for DST. That was super irritating in a dual-boot setup with Linux or BSD, where the kernel would expect the hardware clock to be stable.

        1. 2

          There was the classic Windows 95 bug that occurred if you left the computer on when the clocks go back: it got stuck in a time loop :-)

          1. 2

            heh, cool. Never ran into that one myself.

        2. 1

          Oh, this happened as late as windows vista, the last time I ran a dual-boot setup. It may even still be a thing.

        3. 1

          I’m pretty sure it’s still a thing, but you can turn it off in the registry. Makes dual booting very annoying!