Threads for bdesham

    1. 21

      The article states that “the unpredictability of UUID allows their public usage without disclosing sensitive internal information and statistics” but also acknowledges that UUID v7 “encodes a Unix timestamp with millisecond precision in the most significant 48 bits”.

      I can think of situations where the generation time of a UUID might provide sensitive information, especially when combined with other information.

      1. 15

        Implementations SHOULD NOT assume that UUIDs are hard to guess. For example, they MUST NOT be used as security capabilities (identifiers whose mere possession grants access). Discovery of predictability in a random number source will result in a vulnerability.

        Timestamps embedded in the UUID do pose a very small attack surface. The timestamp in conjunction with an embedded counter does signal the order of creation for a given UUID and its corresponding data but does not define anything about the data itself or the application as a whole. If UUIDs are required for use with any security operation within an application context in any shape or form then UUIDv4, Section 5.4 SHOULD be utilized.

        draft-ietf-uuidrev-rfc4122bis-11 Security Considerations

      2. 2

        So there’s 74bits of random data per UUID. You have a ~50% chance of collision after 2^37 operations. That’s ~137 billion operations. Since this is gating data over the internet that seems sufficient. I didn’t take the fact that we’re assuming we can guess a timestamp here - that likely adds a few extra bits of entropy. Of course, since they’re UUIDs they’re probably not treated like cryptographic secrets and therefor are more likely to be missing timing attacks.

        All in all, I would echo the “don’t expose something that’s a secret to a user”. If you use capabilities for safety internally have an external mapping that treats the value as a proper secret and can manage the complexities therein - rotation, for example.

        1. 12

          With so many architectural leaks in hardware and timing attacks against cryptography, I’d be careful with exposing precise timestamps.

          IMHO the time precision here is excessive. You don’t need it to avoid collisions — that is always done better by random bits (they have maximum entropy, more than the timestamp). You only need time to avoid trashing btree-based database indexes.

          1. 1

            That’s an interesting problem I hadn’t heard of before. For UUIDs you intend to expose to the world, I wonder if you could mitigate this by adding a random offset to each timestamp, then rounding to the nearest second. You’d drop the birthday collision rate to 2**32 per second but that might be fine. (You could also fill the 10 newly-free bits with random, why not? The RFC committee can’t prove a thing!) :)

          2. 1

            I think it helps make time ordering fuzziness more acceptable.

        2. 3

          So there’s 74bits of random data per UUID. You have a ~50% chance of collision after 2^37 operations. That’s ~137 billion operations.

          You’d have to do those 2^37 operations within the same millisecond. So assuming your clocks are running correctly, my guess is that in practice the birthday attack is basically nullified, and you’d have to wait fairly close to 2^73 operations on average to get an actual collision.

          1. 2

            If the value is used in a URL you don’t have a time limit, you just need to guess the timestamp.

            1. 1

              I wasn’t assuming an adversarial setting.

      3. 2

        Absolutely! Any sort of events marked with such an identifier can be used to very precisely determine event ordering and also implied run times. If you were to collect observations of such timestamps along with the events with which they are associated you could find things like race conditions, unstable ordering of concurrent processes, leaked data from business logic which does not have constant time guarantees, etc…

        It is folly to think that preventing someone from guessing what the next monotonic identifier will be is the same as not disclosing sensitive information.

      4. 1

        +1

        I appreciate that there is another option/scheme, but do not see much use-cases for it.

        1. 10

          The main issue with incrementing identifiers is that they are predictable. There was a nice demo a few years back of creating a tweet that linked to a future tweet by guessing the ID of the next one. This was possible because tweet IDs were (roughly) simple counters. With UUIDv7, you still get nice sorting properties, but a user trying this kind of guessing has to guess an 80-bit value.

          1. 3

            I’m not sure if it’s the same thing you’re referring to, but this post by Oisín Moran describes the process of creating a tweet whose content is its own URL. (This was considerably more difficult to do before the Edit button ;-) )

          2. 2

            Was this before Twitter introduced the Snowflake scheme (referenced in the article)? I do believe those are ordered in time but non-predictable.

            1. 2

              IIRC Snowflake IDs are more predictable, because the ID is a timestamp (since a custom epoch) + worker ID + count (for more than one tweet per ms).

            2. 1

              Yes, I think Snowflake solved the same problems.

      5. 1

        their public usage

        I’ve always been told to have separate IDs for the DB and the Application/API layer. Sequential integers or UUIDv7 could be fine for DB IDs, but for application/API IDs, I usually use a base64 encoded hash of the fields with a UNIQUE constraint. Something similar is done automatically with GraphQL Relay IIRC.

        1. 4

          This seems like an awful lot. What industry are you building stuff for?

          1. 2

            I worked on an application that did the same thing in the financial services industry.

          2. 1

            Not sure what you mean by “a lot”. I do not use all the UNIQUE fields in the “business ID”, just the relevant ones. The idea is to separate the technical identifier (used for sharding for example), and the business identifier (presented to the user, used for caching/refetching).

            Sometimes, it’s just as simple as base64("{tablename}_{record_id}"), but if you don’t want to leak the technical identifier (record_id), you use other data.

            I’ve seen this pattern in finance mostly.

    2. 5

      The best feature of Yaml is that you can ignore it and use JSON instead.

      I prefer software to be configured by simple formats, so if hand writing becomes cumbersome, I’m the one choosing which tool to use to generate it, not the software.

      1. 6

        As far as I know, this is false: 1 2. So you can’t assume valid JSON will always be a valid YAML.

        1. 6

          This is because only YAML 1.2 is a true superset of JSON so if you have a <1.2 YAML parser, it can fail to read JSON.

        2. 2

          Didn’t knew about those differences, thanks for pointing it out.

          Now the usefulness of yaml dropped to absolute zero.

      2. 3

        I have to say that this is the best approach. Just take JSON and let the user generate it using whatever tool or language they prefer. I can see some users being annoyed if they are maintaining the config by hand but at least for me that isn’t an issue I have.

        1. 1

          Dhall has already been mentioned here, but I will point out that “compiling” to JSON and YAML is one of its core use cases.

          1. 2

            My point is don’t bake it into the thing you are configuring. Let the user decide which languages/tool they like. They can use Dhall, but my service just accepts JSON. Another used can use YAML and another can use Lua.

            1. 2

              Oh, yes, I was agreeing with you :-) If you accept JSON, then people can write directly in JSON, or they can use Dhall, or they can emit JSON from any number of other libraries or programs.

    3. 12

      It’s an interesting question, some thoughts

      This also “infected” JSON, which infected other languages - https://www.oilshell.org/blog/2023/06/surrogate-pair.html

      • JSON strings represent neither the set of valid Unicode strings, nor the set of valid byte strings. It’s Unicode strings, plus some other errors that travel over the wire

      I found one way the 10-day sprint definitely did harm JavaScript: Brendan Eich didn’t have time to add a garbage collector, and later attempts to add that in added a bunch of security holes

      But they added it later without breaking compatibility. So it may have harmed it in the short term, not the long term

      The type coercion, and the unintended consequences of it, seem to be the biggest problem with JS. There are some variable scope issues that have arguably been fixed.


      But JavaScript’s not unique – many languages from this era share similar problems. In some cases they share the exact same ones!

      • Up until very recently, Lua shared the “all numbers are one type” problem (mentioned in this post wrt JS)
        • JS now has BigIntegers
      • Undefined vars in Lua also evaluate to something like undefined, not raising NameError as in Python, which I think everyone agrees is better
        • Same with nonexistent keys in Lua, whereas Python gives you KeyError and AttributeError

      and

      • I don’t have details off hand, but when I looked into it, both PHP and Perl had basic bugs in their containers. It’s not about the syntax – it’s about the compositionality of the containers, e.g. can you make a List[List[int]] or a Dict[Int, List[Int]]
      • Lua also has only one container, which most people find an odd, false economy (interested to hear an argument otherwise)

      The real problem is that language features interact, and it takes time to test all the interactions and gain experience with them. Empirically speaking, a year is better than 10 days, and probably 5 years is better than 1 year

      So even though it wasn’t just the 10 days things, it’s clear that JS was rushed in many ways. Later it got much more attention, although then that also leads to the design-by-committee problem

      Language design is hard :)

      1. 13

        The fact that month 1 is February comes from Java

        Java in turn copied the behavior from C. Some guy wrote an investigation of that fact that I found interesting.

      2. 10

        What is annoying about JS is that obj.doesnotexist is fine but obj.doesnotexist.somethingelse blows up. It should either blow up the first time or never. Blowing up on the second dot is very annoying.

        1. 2

          And since you cannot have it blow up on the first for compatibility, you now have optional chaining (?.)

          1. 2

            Of a sort. Objective-C (1986) does this, in that messages to nil always return nil. Very useful.

            1. 2

              Sometimes useful, mostly hides bugs because it lets nils spread undetected until something finally blows up and you have a happy fun time tracking down where it came from.

              1. 2

                I almost brought that up. But in my experience it was the other way around: mostly useful, occasionally hid bugs.

      3. 5

        The fact that month 1 is February comes from Java

        The entire Date API was lifted more or less wholesale from Java.

        Which on the one hand is sad because it’s trash, but on the other hand I’m not sure there’s any good API from back then.

        But more than one of JS’s worst issues stem directly from trying to give it a Java coat of paint. Constructors are an other one.

        1. 4

          I find it amusing to imagine that if we magically transported lobste.rs back in time to the late 90s, we could do a wholesale s/Rust/Java and preserve the tone entirely.

      4. 4

        both PHP and Perl had basic bugs in their containers. It’s not about the syntax – it’s about the compositionality of the containers, e.g. can you make a List[List[int]] or a Dict[Int, List[Int]]

        I don’t know about PHP, but what do you mean by bugs in containers in Perl?

        Perl basically only has 2 data structures, scalars and lists (hashes/dicts are a special case of lists). If you want something more complicated, references are your tool of choice.

        List[List[int]] is meaningless in Perl, as there’s no Int type, only scalars. But you can define

        my @list_of_ints = ([1,2,3],[4,5,6],[7,8,9]);
        

        and then (clunkily) get the scalar 9 via

        my $nine = $list_of_ints[2]->[2];
        

        The second example is much more idiomatic in Perl:

        my %ints_by_int = (1=>[1,2,3],4=>[4,5,6],7=>[7,8,9]);
        

        which is accessed by

        my $nine = $ints_by_int{7}->[2];
        

        These data structures are internally consistent, so the term “bug” might be better described as “clunky” or “warts”.

        1. 1

          Yeah I think that is similar to what I’m referring to

          • why is the delimiter () at the first level, and [] at the second level
          • why is the operator [2] at the first level, and ->[2] at the second level

          Now some people might say that’s just syntax, but I believe it messes up real algorithms, especially recursive ones.

          It’s similar to something I discovered in R – R has no distinction between scalar and vector, e.g. c(42) is the same thing as 42 in R.

          So it does not distinguish between dimension 0 and dimension 1. This also means it doesn’t distinguish between dimension N and N+1, because many operations are generic, and often increase or decrease the dimension.

          This realization came after fixing dim==1 bugs in real code. Basically I had to go and add if (dim == 1) in several places – the language design actually caused a bug in the program (which I didn’t write).


          The other quote I usually bring up is from - https://sites.google.com/site/steveyegge2/ancient-languages-perl

          Perl has references.

          Perl’s references are basically pointers. As in, C-style pointers. You know. Addresses. Machine addresses. What in the flip-flop are machine addresses doing in a “very high-level language (VHLL)”, might you ask? Well, gosh, what they’re doing is taking up about 30% of the space in all Perl documentation worldwide.

          Being nominally responsible for developer training here, you’d think I might tend to notice when some particular concept, such as Perl’s references, takes up 1/3 of the classroom time. And 1/3 of the discussion time. And 1/3 of my farging blog. And you’d be right, you would. They’re not pulling their weight. Not by a long shot.

          Languages are hard to design. They are. Really. Language acquisition is one of the most fundamental activities of humankind, one of the most researched, yet still among the least understood.

          There’s a similar issue with shell – it has thing hack called nameref with declare -n.

          Also it has a ref operator ${!ref}, which is not the same.

          To me these are signs of “scalar hacks”, i.e. not having orthogonal, recursive, garbage-collected containers, which can be passed uniformly in and out of functions.


          Anyway, I admit I don’t have an airtight argument off hand, but after working on the data structure and garbage collector for Oils – I realized that Python and JavaScript have “nicely shaped heaps”

          The data structures compose, and it is easy to write recursive algorithms

          Garbage collection works uniformly

          Whereas shell, PHP, and Perl are more similar. They have lots of special cases in their data structures, and are not uniform.

          Shell also shares the issue with Perl where the type is attached to a location for a value, not the value itself. In other words, it’s context-specific.

          Now I know some context-specific stuff is considered a feature in Perl 5, not a bug.

          I’m not really versed in the details right now, but I believe some of it was changed in Perl 6, which indicates that the designers thought it was a design bug in Perl 5.

          My impression is that the data structure/garbage collector design in Perl 6 is closer to JS/Python/Oils than it is to PHP/Perl 5/shell.

          I guess bottom line is that I claim you don’t need that much documentation to explain what Python and JS do, but you need a lot to explain what Perl 5, PHP, and shell do. Also, it’s easier to implement something like Python/JS than the latter, which is another way of looking at the documentation issue.

          Smaller amounts of VM code are easier to document. I have seen some evidence that Perl and PHP both started out as not fully recursive – not garbage collected – similar to how JS did not have a garbage collector to start. Garbage collection is hard to implement !

          Or maybe I will make a weaker claim – people are “used to” Python/JS structures now, which is just as good for Oils :)

          1. 3
            • why is the delimiter () at the first level, and [] at the second level
            • why is the operator [2] at the first level, and ->[2] at the second level

            Perl aggregates (arrays and hashes) are not references by default, which causes this nonuniformity.

            When I was writing Perl, if I had any nesting in my data structures I would use refs as uniformly as possible, so

            my $array_of_ints = [[1,2,3],[4,5,6],[7,8,9]];
            my $nine = $array_of_ints->[2]->[2];
            

            With the mild disadvantage that everything becomes a scalar so you lose static typing.

          2. 1

            Actually I just googled, and stuff like this is pretty strong evidence in the direction of what I’m saying

            https://news.ycombinator.com/item?id=10345728

            https://design.raku.org/Differences.html

            In particular, “references are gone, or everything is a reference”, and more changes like that. There are MANY fundamental changes in containers and their syntax.


            To me the litmus test is – would you choose the same design when implementing a new language, 30 years later?

            For Perl 5, the answer is no – Perl 6 didn’t make the same choices.

            For Python, the answer is roughly yes – Python 3 preserved the same choices (only real difference was string/unicode, plus removing remnants of int vs. long, which were unified early in Python 2. ).

            So yeah I am doubling down on my claim that there’s a fairly hard line in terms of language design between Python/JS and PHP/Perl 5 /Shell – and it does appear to me that in important ways, Raku is closer to Python/JS.

            Oils is very close to Python/JS, cleaning up a few more warts

            • like removing “accidentally quadratic” caused by the in operator being either O(1) on dicts, or O(n) on lists (and again I’ve fixed this bug in other’s people code more than once.)

            And of course it has the whole compatible shell runtime, meaning it’s closer to the OS, and its idioms can be much faster


            I wrote that string-ish languages are a “stable design point”, and what I didn’t appreciate for awhile is that Python/JS-style GC heaps are also a stable design point. And PHP/Perl/Shell don’t have those.

            https://www.oilshell.org/blog/2023/06/ysh-review.html#this-is-hard

            What I would say is that PHP, Perl, R, and shell are awesome tools. I wrote PHP for the first time in 2021, as a little test, and largely liked it. It was way easier to write a simple web app in PHP than in Python.

            But Python is a better language – in terms of expressing correct algorithms, without bugs in the corners.

      5. 3

        UTF-16 is the worst of all worlds, because it’s neither fixed size nor compact.

        Compactness depends on the language. For many it’s more or equally compact compared to UTF-8. Of course HTML is in a large part ascii markup so it tends to tip the scale in favour of UTF-8 whatever the language.

        1. 6

          Yeah there may be some corner cases where UTF-16 does OK, but if you had to pick a default, it’s clear it would NOT be the winner

          Not even Java or JavaScript folks would argue that now

          UTF-8 has already been decided by people voting with their feet – Windows has even taken steps toward UTF-8, and acknowledged the mistake - https://www.oilshell.org/blog/2023/06/surrogate-pair.html#future-windows-and-python-are-moving-toward-utf-8

          It’s just that you can’t change the semantics of JS or Java, so we’re stuck with that legacy

          1. 7

            I’d argue that UTF-8’s killer feature is that it’s a superset of ASCII. Without that it would not have won so easily.

          2. 6

            UTF-16 is more compact than UTF-8 for CJK languages (most characters are two bytes, rather than three in UTF-8). That’s a huge set of people. Storing Chinese text in UTF-16 will have better cache usage, better network efficiency, and so on than UTF-8.

            That said, the reason for adopting UTF-16 was terrible. OpenStep used UCS-2 and Unicode stings everywhere. When Unicode grew, they redefined unichar as a UTF-16 code unit. The Java standard library was mostly designed by ex-OpenStep folks at Sun and picked up this design choice (I can’t remember if Unicode growing beyond 2^16 code points happened just after Java shipped or just before). JavaScript picked it up from Java.

            This made sense when you remember that one of the main selling points for JavaScript was scripting Java Applets. There was no direct way for applets to communicate originally, you put one on the page and it had some limited interaction with the browser. You could use JavaScript to tie together a bunch of widgets into a web app. Having the same string format as Java avoided encoding conversions at this boundary. I doubt more than a handful of sites ever ended up using JavaScript for this.

            1. 4

              Unicode draft 1990, Unicode proper 1991, UTF-8 1992, Java 1995, UTF-16 1996.

              1. 3

                Thanks. It looks as if Unicode 2.0, in 1996, was the first version to grow code points beyond 16 bit. So both Java and OpenStep predated this and it made a lot of sense for them to standardise on codepoint-wide character units.

                OpenStep did this somewhat better than Java by separating string storage from the interface. This lets it store characters as ASCII, some custom encoding, UTF-{8,16,32} internally and expose them as sequences of unicode characters, it’s just a shame that they standardised the width of unicode characters before Unicode 2.0 and so picked 16-bit integers. Doing the same with 32-bit integers is probably the right thing to do.

                1. 2

                  Yup. Win32 and Java and JavaScript all appeared back when there was no UTF-16, and 2-byte chars were Unicode codepoints.

          3. 2

            The benefits of UTF-8 vs. UTF-16 aren’t clear to me. What is clear though is that both Java and JavaScript have Unicode strings, by default, and have had since inception, which is outstanding, given they were created in the 20th century. To give examples of languages that haven’t adopted Unicode at that time: PHP, Python and Perl, so basically the P letter in LAMP.

            So, IDK, I’d rather have the Java legacy, than the C legacy. And Java’s String data type has been one of the best features of Java. But there’s simply no pleasing software developers.

      6. 1

        I think JS’s most unique mistake is not having integers. I really don’t understand this omission, sure, floats can represent many integers, but they have a very unintuitive behavior towards the extremes. While the wraparound behavior of ints was/is also a big source of bugs, it is much easier to reason about.

        I think it was a very big price for very little benefit.

        1. 4

          It saves time during the implementation! Otherwise you have to consider more (Float, operator, Int) cases. Which notably, C got wrong with auto-coercion!

          As mentioned, Lua made the exact same choice as JavaScript. It saves time.

          When you do language design, you basically have an (M operator x N data types) problem. When you subtract 1 from N, it gives you M less work :) And that’s significant.

          I have a blog post draft explaining this concept, requires login - https://oilshell.zulipchat.com/#narrow/stream/266575-blog-ideas/topic/N.20Perlis-Thompson.20Problems.20in.20Language.20Design

          Examples:

          • float/int, or just float
          • signed/unsigned, or just signed
            • Java doesn’t have unsigned, which is painful to C programmers in the same way that having only floats is painful here
            • on the other hand, unsigned in C++ causes design mistakes! Bjarne and many other people agree that std::vector::size() returning and unsigned type is a mistake, even though it can’t be negative. These choices are hard.
          • string/unicode, or just string
          • dict vs. list, or just list (Scheme)
          • control flow: async/sync, or just sync
          • for YSH: procs/func, or just proc

          Basically what I think of as “Perlis-Thompson problems” are the same as Nystrom’s well known “What Color Is Your Function?” post.

          https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/

          Once you introduce async and sync, now yo have to compose them. Similarly, when you introduce int or unsigned, now you have to compose them.

          Doing floats and ints necessarily hard – obviously say Python does it, which similar to JS – but notably many languages have gotten it wrong !!! Which says it’s not trivial.

          Though Python used to have both long int and integer, and they were unified in a mostly compatible way around Python 2. Language design is hard! Lots of compromises.

          Also, it’s not true that choosing 2 types is always better than choosing 1. It’s a pretty deep compromise and tradeoff.

          Most people now seem to think Rust got the async/sync one wrong, etc.

          1. 2

            on the other hand, unsigned in C++ causes design mistakes! Bjarne and many other people agree that std::vector::size() returning and unsigned type is a mistake, even though it can’t be negative.

            This is pretty much entirely on C’s crazy integer promotion. Essentially every argument in “Subscripts and sizes should be signed” boils down to “integer promotion causes this to shit the bed”.

        2. 2

          It’s particularly irritating given that JavaScript is a descendant of Self, which is a descendant of Smalltalk, which was heavily inspired by Lisp and all of these had an integer type stored as a tagged value in a pointer and transparently promoted to a (heap-allocated) big integer object on overflow.

          1. 1

            I’m currently involved in a Scheme to WASM compiler so this is hitting hard. The lack of JS ints makes the Scheme<->WASM bridge less efficient that it would be otherwise.

        3. 1

          For my money the biggest mistake is having every nonsensical numerical operation return nan instead of faulting.

          JavaScript is the only language I know where unexpected nans are routine, and I don’t think I’ve ever been happy for it.

          1. 1

            I thought JavaScript is basically the same as C or Rust in this respect, ie, following IEEE 754. (Python floats are the same, I think?)

            One notable difference is that integer division by zero (or overflow) raises an exception, but because JavaScript doesn’t normally do integer division, you get INF instead.

            1. 1

              I thought JavaScript is basically the same as C or Rust in this respect, ie, following IEEE 754. (Python floats are the same, I think?)

              What I’m talking about is all the operations which generate nans from non-nans, not the existence of nans.

              Very few languages return a nan when you ask for 1 - 'fanf', even ignoring those who’ll yell that it’s a type error.

              1. 1

                Oh, yeah, that is weird. Possibly more sensible than Perl, though.

          2. 1

            They are part of the floating point standard, so it is probably the same in all languages, they just use integers most of the time — hence my questioning the decision omitting them.

            1. 2

              nans are part of the floating point standard. Array(16).join( 'hero'-1) + "Batman"; is not.

    4. 6

      YAML has a lot of problems, but this website is also pretty silly. Octal and base-60 integers are just bullshit, sure. No being false seems dumb when it’s just bitten you, but it’s a feature that lots of people use, so it’s pretty obvious how it seemed like a good idea.

      The rest of these would be even worse if kubernetes or ansible or whatever CI provider used INI or JSON. YAML is a bad tool, but it’s also the wrong tool, and switching to a better wrong tool won’t help with that.

      1. 13

        No being false seems dumb when it’s just bitten you, but it’s a feature that lots of people use, so it’s pretty obvious how it seemed like a good idea.

        I guess I just want to use technologies where “I guess it seemed like a good idea” isn’t enough to get features included; is that too much to ask?

        1. 9

          I mean I don’t disagree, but YAML was created by humans trying to do a good job, and I guess I feel like we owe it to them to recognise that they weren’t just doing gratuitously stupid shit to spite us. It’s a mistake I think a lot of people would have made without the benefit of hindsight.

          1. 9

            I think it’s possible to acknowledge on the one hand that “YAML was created by humans trying to do a good job,” and on the other hand that YAML has some significant downsides that make it a bad choice in many situations. It’s really hard to create a good configuration file format. (I think the YAML authors did a better job than I would have!) But just because it’s a difficult problem doesn’t mean that every solution is equally worthy.

          2. 6

            Humans trying to do a good job can still end up doing a bad job. I don’t think anyone involved in the creation of YAML acted immorally for having done it; that said I would prefer it if other pieces of software I use did not use yaml as their configuration formation.

      2. 7

        No being false seems dumb when it’s just bitten you

        That isn’t the problem, the problem is unquoted strings. If strings don’t need to be quoted then you have a huge footgun because anything that’s a single token identifier is now not a valid string and must be quoted if you want to use it as such. You can’t ever add new tokens to the language because someone might have used them as a string.

    5. 31

      From running, enduring, and observing several rounds of hiring:

      • Keep it short, ideally a page front and back. Resume, not CV.
      • I will check that you have public code. No public code is–usually–negative signal.
      • I will check that your Github stuff (for example) is not just forks of other work.
      • Be specific in what you did in a role–I know how people write these things, and it’s a red flag to say “helped ship a project”. You could’ve been the coffee gopher and that statement would still be true.
      • Cover letter, especially if asked. Easy “can this person read and follow simple directions?” test.
      • For higher-end and executive positions, write thank-you notes. This one surprised me, but I’ve seen it cost some veep candidates their shot.
      • Paragraphs are not as helpful as concise bullets.
      • Dates and titles are helpful.
      • If you have one, mention your clearance.
      • Take a second to trim your mentioned experience to the job–if I’m hiring an EM, code experience is not quite as interesting to me. If I’m hiring an IC for a web thing, your school raytracing project is off-topic.
      • Don’t add any social media you wouldn’t want to be considered from a culture-fit standpoint. Your X account owning the libs or your Mastodon making fun of boomers may not have the effect you expect.
      • Spelling and grammar mistakes are extra bad. Easy problem to solve, and it makes you look sloppy and inattentive to detail…typically bad qualities in a candidate.
      • If you are applying for a job outside your skillset (say, MUMPS programming), including experience that emphasizes your adaptability.

      All of these of course have exceptions, of course–if you spent a few years at a defense contractor I’m not going to be too surprised if you don’t have a lot of public source code.

      1. 43

        No public code is–usually–negative signal.

        As you point out, there are many employers who

        • …don’t want any of their code to be published
        • …don’t want to jump through the (perceived or real) legal hoops of publishing code under a FOSS license
        • do allow their employees to publish code, but put a bunch of red tape in the way, so that it would be self-defeating for any employee to actually try to do it

        It’s not at all limited to defense contractors.

        The larger problem with using public code as a signal is that it puts people at a disadvantage if they don’t have the time or energy to publish projects outside of work. Lots of people have caregiving responsibilities that don’t leave them time for outside-of-work work, and a hiring process that values a well-stocked GitHub profile implicitly devalues parents and other groups.

        1. 8

          I read it charitably as “usually” and “signal” doing a lot of heavy lifting. I.e., no public code won’t instantly disqualify a candidate, but will be a nail in the coffin if there are other negative signals. Which I think is valid.

          1. 18

            Right, so in a heads to heads comparison between two candidates you’ll choose the one without kids? Or you’ll favor the young one over the older, because the older one “can’t show what code they’ve been writing because of having an actual job” whereas the young one can more easily point to work done in public recently?

            Like you understand “can have publicly listed code” is going to be significantly biased by age, right?

            Similarly the way a lot of women are treated on line means many intentionally limit their public presence, so I suspect you’ll get gender bias there as well.

            1. 3

              Sounds convenient!

          2. 6

            The problem with @friendlysock’s approach with regards to the public code is that a lack of a positive signal is not the same as a negative signal.

            Lacking a positive signal means that the things you could have learned (in this case: code quality, motivation to code off of work hours, etc) you have to learn from another way.

            A negative signal is something that is either an instant disqualification (a belligerent public online persona) or something that needs to be combatted by more positive signals (a spelling error on the resume might be mitigated by a long-standing blog that communicates clearly).

            For most companies/positions, using lack of a Github profile shouldn’t be considered a negative signal unless the position is something like “Open Source Developer Evangelist”.

            And I agree with @olliej’s reply below that a lack of a Github profile isn’t a great filtering measure, even if you are so flooded by resumes that you need some kind of mass filtering measure. Here are some reasons I wouldn’t use it as a first filtering mechanism:

            • It’s not a simple pass (everyone with a Github passes the screen)
            • It’s not a simple reject (“usually negative” means you need to weigh against something else anyway)
            • It’s subjective
            • It takes a significant amount of an engineer’s time to do it
            • You are trying to quickly evaluate code in an unfamiliar project or projects, and perhaps in an unfamiliar language, which will have big room for error
          3. 2

            Bingo. In practice, I almost always ask about it–some people just have private hosting or whatever, or have some other reason.

            The thing I also think a lot of people miss is: I had over a thousand (no joke, 1e3) applicants for a junior position I opened up. When you are trying to plow through that many applicants, applicants without easy code to show their talent are automatically lower priority in the heap than those with.

            1. 11

              … so you looked all the code from those folk, or you just went “does/does not have a GitHub profile” as a filter?

              Again, this seems like a really good way to discriminate against people who are lower income, have families, etc. Not intentionally, just that that is the result of such filtering.

              For example, when I was at uni there was a real sharp divide between people who did open source work and those who did not, and it was super strongly correlated with wealth, and not “competence”. It’s far easier to do code beyond your assignments and whatnot if you don’t also have essentially a full-time job, or you don’t have children to care for, etc. The person that I would say was the single best developer in my uni’s CS department was also working pretty much every hour outside of uni for his entire time there. By your metric they would be worse than one of the people in my year, who I would argue was far below in competence but did have a lot of open source code and “community involvement” because his family was loaded.

              1. 6

                This reminds me of the discussions about how screening résumés with names removed to prevent bias still end up failing because you can tell so much from other clues like someone playing lacrosse in college, or that they went to a HBCU or an all-women’s college, etc.

                1. 6

                  Notes for people outside the USA, a HBCU is a “historically Black college or university”.

              2. 3

                Software development is a qualified job – you have to invest something (your time at first) before you can earn money. You read books, follow tutorials, discuss things with more experienced folks, study at university, do your own projects, study existing free software and contribute to it, get some junior job or internship etc. This is all part of preparing for a more qualified job.

                How does a university degree requirement differ from taking your own public projects into consideration? Both cost you your time. (not mentioning that diploma is often a mandatory requirement while own projects are just softly appreciated + getting a diploma is a much larger investment than writing and publishing some code, the entry-barrier in IT is very low, compare it also to other fields).

                If I ask a candidate: show me a photo of your bookshelf (or list of eBooks), tell me something about your favorite books that helped you grow professionally or tell something about an article you read and that opened your eyes… do you think that it is also bad and discriminatory? Because not everyone has time to study books and read articles…

                Another aspect is enthusiasm. The abovementioned activities are not done intentionally to look good for future employer, but because you like them, find them entertaining or enriching.

      2. 24

        I will check that you have public code. No public code is–usually–negative signal

        Then you’re rejecting a lot of excellent people for no good reason. Many (most?) jobs don’t let you publish your work code, put restrictions your ability to contribute to OSS projects, and consider code developed by employees to be there’s (e.g. you need special permission to publish anything). This is in no way restricted to defense contractors, in my experience this is the norm for any case where your job is not explicitly working on OSS software. You may philosophically disagree with these employer’s policies but that still the reality for most developers.

        1. 3

          I agree with this. Like programing in MUMPS its not usual to public the code because the type of bussiness

      3. 21

        I will check that you have public code. No public code is–usually–negative signal. I will check that your Github stuff (for example) is not just forks of other work.

        The older I get the weirder this idea seems: evaluating someone for a paid position based on the quality and quantity of work they do outside of the time that they’re paid to do a job as a professional. Does any other profession work this way?

        1. 19

          Nobody asks accountants to show audits they’ve run or tax forms they’ve filed in their spare time for fun.

          Nobody asks civil engineers to have a portfolio of bridges they built as hobby projects.

          Nobody should ask developers to have a “GitHub résumé”.

          1. 17

            But if your hiring an accountant and there’s one who runs audits for fun and has a blog with the places where they caught major errors in the audits that they did for fun, you can bet they’d be near the top of the hiring pile.

            For a lot of other professions, (especially arts and engineering) there’s a concept of a portfolio: a curated set of work that you bring to the interview to talk through, and which you may be asked to provide up front. With software engineering, it!s easy to make your portfolio public so it can be used earlier in the hiring process.

            1. 4

              Nobody has an expectation that accountants or many other professions will have professional-quality work done, for free, on one’s spare time, or suggests that the presence/absence of such should be a significant factor in hiring decisions.

              Also, it’s not “easy to make your portfolio public” in software. Out of all the companies I’ve worked for across my entire career, do you know how many of them even have a listing of their main repositories public on a site like GitHub? One, and that was Mozilla. Every other company has been private locked-down repos that nobody else can see. I can’t even see former employers’ repos.

              The only way to have a “portfolio” like you’re suggesting is thus to do unpaid work in one’s own free time. Which is not something we should expect of candidates and not something we should use as a way to compare them or decide between them.

              1. 1

                Also, it’s not “easy to make your portfolio public” in software.

                In the time it took me to write my comments in this thread, I could’ve signed up for Github (or Gitlab or Bitbucket or whatever) and opened a new repository with a basic Sinatra, Express, or even Bash script demonstrating some basic skill. Hundreds of thousands of developers, millions probably, have done this–and it’s near standard practice for any bootcamp graduate of the last decade.

                The only way to have a “portfolio” like you’re suggesting is thus to do unpaid work in one’s own free time. Which is not something we should expect of candidates and not something we should use as a way to compare them or decide between them.

                You don’t have to have a portfolio online. You don’t have to ever do any work that isn’t attached to a billable hour. Similarly, I also don’t have to take a risk on interviewing or hiring you when other people show more information.

                1. 5

                  Similarly, I also don’t have to take a risk on interviewing or hiring you when other people show more information.

                  This sounds more like a failure in your interviewing process than anything else.

                  So, look. I’ve run more interviews than I could count or care to remember. I’ve helped design interview processes at multiple companies. I’ve written about interviewing processes and given conference talks about interviewing processes. I am not lacking in experience with interviewing.

                  And this is just a gigantic red flag. As others keep telling you, what you’re doing is not hiring the best candidates. What you’re doing is artificially restricting your candidate pool in a way that excludes lots of perfectly qualified people who, for whatever reason – and the reason is none of your business and in many cases is something that, at least in civilized countries, you wouldn’t even legally be allowed to ask about in the interview – don’t have a bunch of hobby/open-source projects on GitHub.

                  1. 1

                    I feel I’ve explained my process (including many “this is not a hard-and-fast rule” qualifications) sufficiently well and accurately, and have given honest and conservative advice for people that I sincerely believe will help them get a job or at least improve their odds. If this is unsatisfactory to you, so be it.

                    I’m not interested in discussing this further with you, good day.

            2. 3

              More than that: introspective professionals are valuable. All paid coders should be able to write up some fun algorithms and discover them for a given need, but not all will go above and beyond in their understanding and mentorship.

              It’s a useful signal when present. It’s not a useful signal if absent. It’s a very negatively useful signal if all you have on your public commits is messages like “blah” and zero sanity in your repository layout.

              I tell people who are learning to code to get blame in other people’s projects, to learn good style and show some useful activity beyond forking a project and uploading commits of questionable value to the internet.

            3. 1

              I thought writing software was an art or a craft…

              This sounds a lot like people wanting to have it every which way whatever’s convenient.

          2. 7

            Nobody asks accountants… Nobody asks civil engineers…

            No, but they do require education and formal credentialing.

            1. 7

              Supposedly all these hoops we make people jump through in programming interviews are because the interviewers say they see too many people with degrees and impressive credentials who can’t write a for loop.

              1. 9

                If the software certification exams were anything like the CPA certification exams, we wouldn’t need to do nearly as many technical interviews. In other fields getting certified is an ordeal.

                1. 1

                  Sure. Now, come up with a standardized exam that everyone will agree covers what you need to be hireable as a programmer :)

                  1. 3

                    Other fields managed it: the CPA standardized exam takes 16 hours (not to study, to actually take) and the architecture ARE takes 22 hours.

                    Or we could not throw software engineers through that kind of meat grinder and stick with using other signals, like portfolios and technical interviews.

                    1. 3

                      If it were possible to build a single exam that actually did it, I don’t know if I’d mind just because it would end a lot of pointless discussions and avert a lot of horrible processes.

                      Meanwhile, asking for a “portfolio” or using it to decide between candidates has problems that are well-documented, including in this thread, and I don’t really think we should be perpetuating it. It’s one of those interview practices that just needs to go away.

              2. 1

                I’d say that’s not true for any graduate from my university. Many people from the non-CS faculties are even forced through a basic programming course.

          3. 5

            Nobody asks artists for a portfolio? Nobody asks engineers for previous specific projects, even if the details are obscured?

            The projects one is way more than a job role. The portfolio is often of paid work where there has been a release for a portfolio, or work done outside the office.

            1. 3

              I’ve worked for multiple companies that used GitHub for their repositories. If I were applying for a job with you today, and you browsed my GitHub profile, you would not see any of the code I wrote at those companies, or even the names of the repositories.

              When people talk about a “portfolio” they always mean code written, unpaid, in one’s own spare time, unrelated to one’s current job, and many perfectly well-qualified programmers either do not do that or cannot do that due to not having the luxury of enough time to do so and make it look good.

          4. 3

            Nobody asks civil engineers to have a portfolio of bridges they built as hobby projects.

            Not true. Architects, for example, design many buildings during studies or send proposals for architectural design competitions. Most of that buildings are never build and remained only on paper. And were created in spare time. Guess what such architect would discuss on the job interview… Portfolio of the proposals and unrealized designs is very important.

            Doctors spend long time in poorly paid or unpaid work before they gain enough experience. Journalists or even writers have to write pages and pages for nothing, before they earn some money. Music bands, actors, painters, carpenters, joiners, blacksmiths, etc. etc. Actually it is quite common pattern across the society that you have to prove your skills before getting a good job.

            Maybe the world is „unfair“ and „cruel“, but if I compare IT with other fields… we have not much to complain about.

            1. 2

              Again, nobody expects a civil engineer to have a portfolio of actually completely-constructed full-scale physical real-world bridges built in their spare time for free as hobby projects.

              If you want to argue for apprenticeship as a form of education, feel free to, but apprenticeship is different from “do unpaid work on your own time”.

              1. 3

                Most open source code exists to ‘scratch an itch’. It’s written because the author had a problem that wasn’t solved by anything that existed on the market today. If you have never encountered a problem that can be solved by writing software in your life then you’re almost certainly in a tiny minority. If you’ve encountered such problems but not tried to solve them, that tells me something about you. If you’ve encountered them and not been able to solve them, that also tells me something.

                1. 3

                  If you’ve encountered such problems but not tried to solve them, that tells me something about you.

                  Yes, it tells you that they’ve encountered such problems but not tried to solve them. Nothing more. You can’t know why someone doesn’t spend their free time doing their day job again for fun. Maybe they just don’t enjoy doing their day job again, which would be terrible, somehow, according to this thread. But maybe they just have even more important things to do than that?

                  Why guess? What do you think you’re indirectly detecting and why can’t you just ask about it?

                  1. 1

                    As others have pointed out to you repeatedly in this thread, no one is saying don’t ask. But if people encounter problems that are within their power to fix, yet don’t fix them unless they consider it part of their job, then that’s definitely an attitude I’d like to discuss in some detail before I considered making a job offer,

                    1. 2

                      Nobody has pointed out anything to me on this thread before, repeatedly or otherwise.

                      Everyone encounters problems that are “within their power to fix” and doesn’t fix them all the time. I don’t think that’s hyperbole. We could fix any of them, but we can’t fix all of them because our problem-fixing resources are finite. I take your position to be that if they happen to prioritise the software in their life over any other kind of problem they might encounter that that means they are going to be better at their job. I think this is a bit silly.

                      For what it’s worth, I get home from my computer job most days somewhere on the mood spectrum between wanting to set fire to all computers and wanting to set fire to anyone who’s ever touched one. I’d love to get a job that doesn’t make me feel like that, and it’s rather frustrating to know that my job sucking all the joy out of computing for me also makes me unqualified to get a better one, at least in the eyes of quite a lot of people here.

              2. 2

                Your open source app doesn’t have to look good. It just kinda has to exist, and maybe have a readme. If it works, that’s even nicer.

          5. 1

            Accountants are certified. I have a CS degree from a brick and mortar university.

            Do you think we shouldn’t hire people without credentials?

            1. 2

              Which credentials do you intend to require?

        2. 7

          The signal exists. Should it be ignored because other industries don’t have an analogous signal?

      4. 7

        Resume, not CV

        What does this mean to you? They’re synonyms to me, so I’ve never really tried to define how they might differ.

        I will check that your Github stuff (for example) is not just forks of other work

        This seems a bit of a red-herring to me. I include my GH to show that yes I really know how to program so we can skip the mutually-embarrassing “are you a complete fraud using somebody else’s CV” stage, not to show that I own several interesting repos. I mean, there’s a few in there that I actually started and they used to be things people used. But 90+ percent of “my” repos are forks because that’s how you contribute to many existing projects.

        1. 3

          But 90+ percent of “my” repos are forks because that’s how you contribute to many existing projects.

          Two things you can do here that are useful:

          • Make a branch that contains code that you wrote the default. I will probably click on them. If I see branches that have raised PRs and good interactions between you and the upstream that’s a very positive thing. Especially if the PRs are merged.
          • Pin repos that you want me to look at. GitHub gives you (I think) six repos to show in the profile screen. These should be the ones that you think best showcase your work.
        2. 2

          (answering you and @enn in same place)

          What does this mean to you?

          I’m used (rightly or wrongly) to resumes being shorter documents that are typically more focused for a particular job, especially in the US. CVs are typically longer, have a lot more detail including coursework, talks, presentations, publications, and other stuff. My understanding is that CVs are also more common in academia, which I’ve never hired for.

          But 90+ percent of “my” repos are forks because that’s how you contribute to many existing projects.

          Indeed, which is why I also tend to click-through to a few of the repos to see if people have commits or attempted commits in those projects.

          There are folks that, if you exclude forks, suddenly go from scores of repos to perhaps less than 10. There are folks I’ve seen who only have a few forks and no source repos of their own, but who have made significant contributions to those forks. My experience is that there are far more of the former than the latter, because the first order signalling is “how many repos do you have on Github” for people that care about such things and that’s how you spoof.

        3. 2

          It’s pretty common to use “CV” to mean a complete list of all prior work, education, and awards, and “resume” to mean a one page summary of relevant experience.

      5. 5

        I will check that your Github stuff (for example) is not just forks of other work.

        If those forked repos are there because the person is contributing to others’ open-source projects, I would argue that kind of work is probably more reflective of the skills that are useful in most professional programming jobs in industry than a bunch of solo projects, however impressive.

    6. 3

      In my experience, a lot of the widely-cited problems with YAML go away if you’re deserializing it into statically-typed data structures. Or maybe more precisely, if you have a schema that defines the data types, which you kind of get implicitly as part of deserialization depending on your language/environment. For example (Kotlin):

      import org.yaml.snakeyaml.Yaml
      
      val yamlDocument = "version: 1.20"
      
      // The target class declares "version" as a string
      data class YamlTest(var version: String? = null)
      
      val deserialized = Yaml().loadAs(yamlDocument, YamlTest::class.java)
      
      println(deserialized.version)
      

      prints 1.20, not 1.2. Not to say the complaints about YAML are without merit, but it’s easy to come away from articles like this with the incorrect conclusion that it’s outright impossible to avoid a lot of YAML’s problems.

      1. 1

        When you mentioned static typing I was expecting an example where the parser was expecting a String, but saw a numeric literal (1.20) and threw an error because of the mismatch. But the behavior of that library seems… worse than the alternative, almost? Like, 1.20 is unambiguously a number in YAML, right? Why does the library coerce it to a String for you?

        YAML’s behavior is surprising enough as it is; if some library does an additional surprising thing to try to be helpful then that seems like it would make the YAML ecosystem more surprising, not less.

    7. 63

      It is an indictment of our field that this opinion is even controversial. Of course XML is better than YAML. YAML is a pile of garbage that should be abandoned completely. XML at least has a lot of decent engineering behind it for specific purposes.

      1. 68

        Meh, these kind of absolute statements don’t really shed any light on the problem

        Seems like fodder for self-righteous feelings

        1. 28

          You’re right. The principles should be laid out: Ease of reasoning about configuration file formats is vastly more important than conveniences for writing specific values. Implicit conversion among types beyond very basic lifting of integer types is a bad idea, especially for configuration file formats. Grammars for configuration file formats should be simple enough to write a complete, correct grammar as a one day project.

          XML is kind of a weird animal because it’s playing the role equivalent to text for JSON. The principles above apply to the DTD you apply to your XML schema.

          1. 1

            Where does YAML do implicit type conversions?

            1. 6

              The Norway problem is a good example of this.

              1. 2

                There is no implicit type conversion going on on the YAML side. no is a boolean in YAML, just like false is a boolean in JSON. If a YAML parser converts it to a string, that’s the parser’s problem.

                1. 3

                  Ha. I can tell you’ve never written a parser before!

                  1. 2

                    No, @xigoi is right, strictly speaking. The parser is where this conversion is going on. Only if it cannot read it as anything else, it reads unquoted literals as if they were quoted strings. Of course, to a user that is neither here nor there: the rules need to be memorized to be able to use unquoted literals correctly.

                    1. 6

                      the rules need to be memorized to be able to use unquoted literals correctly

                      You’ll have a better time if you just use quotes by default… I don’t understand the appeal of unquoted literals in YAML

                      This, for me, is the root of it. YAML is fine as long as you are explicit. Now what it takes to be explicit is going to be driven by what types you intend to use. It seem, to me, that the majority of yaml use cases intend to use only a handful of scalar types and a handful collection types. That small set of types, not coincidentally, is basically the same as what you get in JSON and properly formed JSON is always valid YAML. So I would assert that if you use YAML and explicitly quote string values that you are effectively getting a slightly looser JSON parser which happens to allow you to write a flavor of JSON which is much easier for human concerns; I.E. less picky about trailing commas, supports comments, and is easier on the eyes with some of its constructs.

                      Of course, we’ve got a whole shitload of options these days, so I wouldn’t be surprised if some other markup/serialization format is better in any given specific domain. Different tools for different jobs…

                      One thing I will absolutely agree with is that YAML is awful when used as a basis for psuedo-DSLs, as you see in things like ansible and a lot of CICD systems.

                      1. 2

                        I think we basically agree, but in my opinion one should accept that people are lazy (or forgetful) and use shortcuts, or even copy/paste bad examples. This is like saying sloppiness in PHP or JS is not a problem because one can always use ===.

                        Most people don’t have the discipline to be explicit all the time (some don’t have the discipline to ever be explicit), therefore it’s probably safer to avoid tools with overly prominent inbuilt footguns entirely.

        2. 3

          TBH it seems that way because it almost feels pointless to reiterate the absurdity of YAML.

        3. 7

          Rubbish, the list of semantic surprises in YAML is long and established. The problems with XML boil down to “my fingies are sore fwom all the typing” and fashion.

          1. 21

            One of the most talented developers I know can only work for 2-3 hours a day on a good day because of RSI. I don’t think your patronising take carries the weight you think it does.

            1. 3

              That some people have physical difficulties does not at all impact the validity of the greater population’s supposed concerns about verbosity.

              1. 3

                Let’s also make websites inaccessible because most people don’t need screen readers, shall we?

                1. 1

                  You’re making my point. We have accessibility standards and specialised tools. We don’t demand web pages don’t have videos.

          2. 10

            There are other issues with XML. Handling entities is complex as are the rules for name spacing. Writing an XML parser is complex so most people use libxml2, which is a massive library that doesn’t have a great security track record. For most YAML use cases (where the input data is trusted) this doesn’t matter too much. Parsing YAML is also incredibly hard so everyone uses the same YAML parser library.

            1. 1

              Problems in a specific parser can’t be called problems in the format itself. For what it’s worth YAML’s popular parsers have also had horrible security problems in the past.

              If you have a minute to go into detail, I’m interested in what I’ve missed that makes namespaces complicated, I found them pleasing when used correctly, and frankly used so infrequently that it hardly ever came up, outside of specific formats that used xml as a container, for example MXML. But this knowledge is old now in my case, so I probably just missed the use case that you’re referring to.

              The entity expansions should never have been a thing, that much I’m sure we can all agree on. DTDs were a mistake, but XSD cleaned most of that up; but unless you were building general XML tooling you could in most cases ignore schemas and custom entities completely.

              What’s good about XML (aside from how much support and tooling it once had) is IMO:

              • The consistency with which the tree structure is defined. I don’t know why “modern” markups are all obsessed with the idea that the end of a node should be implied by what’s around it, rather than clearly marked, but I can’t stand it.
              • A clear separation of attributes and children.
              • Consistency in results, in that there are no “clever” re-interpretations of text.
              1. 2

                Consider this made up XML:

                <?xml version="1.0" encoding="UTF-8"?>
                <something>
                  <thing xmlns="mynamespace">
                    <item>An item.</item>
                  </thing>
                </something>
                

                Now, let’s query element item using XPath:

                //something/*[namespace-uri()='mynamespace' and local-name()='thing']/*[namespace-uri()='mynamespace' and local-name()='item']

                🤯

                And now imagine querying some element from a deeply nested XML that might contain more than one custom namespace.

                In my opinion XML namespaces just make it harder to work with the documents.

                1. 1

                  Dear XPath, please adopt Clark notation so we can do /something/{mynamespace}thing/item

                2. 1

                  Yeah that’s rough as guts 🤣 I’ve never seen somebody override the current namespace in the middle of the document, I never even considered that as something you could do. Nobody should have done this, ever.

                  1. 2

                    As a real-world use case, <svg> elements within HTML documents often set the namespace.

                  2. 1

                    Probably not specifically in this way but I am sure you’ve worked with documents which use different namespaces with nested elements.

              2. 2

                It’s almost 20 years since I did anything serious with XML, but I seem to remember the namespace things let you define alternative names for tags to avoid conflicts, so you had to parse tags as their qualified name, their unqualified name in the current namespace (or the parents?) or their aliased name.

                A lot of the security issues of libxml2 were due to the inherent complexity of the format. There are a lot of JSON parsers because the format is simple. You can write a JSON parser in a couple of hundred lines of code if you have a decent string library. A compliant XML parser is at least one, probably two, orders of magnitude more complex. That significantly increases the probability that it will have bugs.

                I’m also not sure I agree on the ‘clear separation of attributes and children’ thing. XML formats that I’ve worked with have never managed to be completely consistent here. Attributes are unstructured key-value pairs, children are trees, but there are a lot of cases where it’s unclear whether you should put something in an attribute or a child. Things using XML for text markup have to follow the rule that cdata is text that is marked up by surrounding text, but things using XML as a structured data transport often end up accidentally leaking implementation details of their first implementation’s data structures into this decision.

          3. 10

            If you’re creating XML by hand, you’re doing it wrong.

      2. 20

        I have zero real world issues with YAML, honestly. I’ll take YAML over XML every day of the week for config files I have to edit manually. Do I prefer a subset like StrictYAML? Yep. Do I still prefer YAML over anything else? Also yep.

        1. 11

          The problem with YAML is that you believe you have no real world issues until you find out you do.

          1. 5

            This sounds like all the folks who have “no real issues” with MySQL, or PHP (or dare I say it, JavaScript). Somehow the issues with YAML seem more commonly accepted as “objectively” bad, whereas the others tend to get defended more fiercely. I wonder why!

          2. 1

            What is an example of a problem with YAML that syntax highlighting won’t immediately warn you about?

            1. 11

              At a previous employer we had a crazy bug that took a while to track down, and when we did it turned out the root cause was YAML parsing something as a float over a string yet the syntax highlighting parsed it as a string.

              I wasn’t the developer on the case, so I don’t remember the exact specifics, but it boiled down to something like we used a object ID which was a hex string, and the ID in question was something along the lines of:

              oid: 123E456
              

              Which according to the YAML spec allows scientific notation. Of course this could be chalked up to a bug in the syntax highlighting or failure on our part to not use quotation marks but he results were the same; a difficult to track down bug downstream.

        2. 6

          real-world problems i’ve had with yaml:

          • multiline strings (used the wrong kind)
          • the norway problem
          • no block end delimiter (chop a file in half arbitrarily and it still parses without error)
    8. 3
      body {
        min-height: 100vh;
      }
      

      What’s the purpose of this? The author only says that it “is pretty handy too, especially if you’re gonna be setting decorative elements,” which tells me nothing.

      1. 3

        I wouldn’t want it in a reset because it’s an opinionated choice, but it’s useful for when you have short pages (potentially not a whole screenful) and you want the footer to be at the bottom of the screen without any blank space under it. The alternative is the footer is in the middle of the screen and followed by blank space, which can be ugly.

        Most sites however can fill a screen between just the header and the footer so it’s not always necessary. Also, and this is niche but it comes up a lot in the news industry, if you try to make a seamless iframe, it screws up the page height detection.

        1. 1

          To get a footer stick to the bottom of the body, you need a lot more CSS.

          To me this rule alone is just asking for trouble with unwanted scrollbars (there’s a half dozen different definitions of viewport height depending on toolbars, keyboard, dead areas, etc.)

          1. 1

            To be clear, it’s not for sticky footers, just growing the content area until the footer is at the viewport bottom. But I agree, it’s mostly unnecessary and it can be wrong if you need dvh or whatever instead.

    9. 5

      Support for image-set() is nice, but it does also feel like we now have a ridiculous number of ways to tell a browser, “I would like to display an image, and here is a list of image files that you could load based on the size or resolution at which the image will be displayed, or based on which image formats you support.” Specifically, we already had <img srcset> and <picture>. This CSS-Tricks article explains when and how to use them, but it’s pretty complicated! At least this new version is part of CSS and not HTML, I guess.

    10. 7

      Are there multiple interoperable implementations of the sqlite file format? Is the format specified somewhere? Does the format remain backwards compatible indefinitely?

      I don’t know the answers, but it feels like these are more important questions when considering a document format.

      1. 6

        I think your latter two questions are addressed right on the SQLite home page:

        The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through the year 2050.

        1. 2

          I’m probably just anxious after the sqlite2 -> sqlite3 breakage, though maybe that taught them the value of keeping things stable.

          1. 4

            Would you care to elaborate? Docs suggest that sqlite3 was released 2004-08-09, I have not read anything about instability or migrations issues.

          2. 4

            That’s a long time to be anxious for, it may be time to let that go ;-)

      2. 2

        D. Richard Hipp addressed that in a comment on Hacker News: https://news.ycombinator.com/item?id=37558809

        SQLite file format spec: https://www.sqlite.org/fileformat2.html

        Complete version history: https://sqlite.org/docsrc/finfo/pages/fileformat2.in

        Note that there have been no breaking changes since the file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the “reserved for future extensions” bits with descriptions of those extensions as they occurred.

    11. 10

      I understand the rationale for including the original emoji (unicode wants to be a superset of existing character sets) but they should have been put in a code space reserved for backwards compatibility with bad ideas, not made such a big part of unicode.

      At this point, there’s a strong argument for a new character set that is a subset of unicode that removes all of the things that are not text. We already have mechanisms for embedding images in text. Even in the ‘90s, instant messaging systems were able to avoid sending common images by having a pre-defined set of pictures that they referenced with short identifiers. This was a solved problem before Unicode got involved and it’s made text processing an increasingly complicated mess, shoving image rendering into text pipelines for no good reason.

      The web could have defined a URL encoding scheme for emoji from an agreed set, or even a shorthand tag with graceful fallback (e.g. <emoji img="gb-flag;flag;>Union Flag</emoji>, which would render a British flag if you have an image for gb-flag, a generic flag if you don’t, have ‘Union Flag’ as the alt text or the fall back if you don’t support emoji). With the explicit description and fallback, you avoid the things like ‘I’m going to shoot you with a 🔫’ being rendered as ‘I’m going to shoot you with a {image of a gun}’ or ‘I’m going to shoot you with a {image of a water pistol}’ depending on the platform: if you didn’t have the water-pistol image, you’d fall back to the text, not show the pistol image.

      1. 24

        Like it or not, emoji are a big part of culture now. They genuinely help convey emotion in a fairly intuitive manner through text, way better than obscure tone indicators. I mean, what’s more understandable?

        “Are you going to leave me stranded? 😛”

        “Are you going to leave me stranded? [/j]”

        It definitely changes the meaning of the text. They’re here to stay, and being in Unicode means they got standardized, and it wouldn’t have happened otherwise.

        Of course there’s issue with different icon sets having different designs (like how Samsung’s 😬 was completely different from everyone else’s), but those tend to get resolved eventually.

        1. 4

          Like it or not, emoji are a big part of culture now. They genuinely help convey emotion in a fairly intuitive manner through text, way better than obscure tone indicators.

          Except they don’t. Different in groups assign different meanings to different ones. Try asking someone for an aubergine using emoji some time and see what happens.

          “Are you going to leave me stranded? 😛”

          This is culturally specific. It’s an extra set of things that people learning English need to learn. This meaning for sticking out your tongue is not even universal across European cultures. And that’s one of the top ten most common reaction emoji, once you get deeper into the hundreds of others the meaning is even further removed. How would you interpret the difference between 🐶 and 🐕 in a sentence?

          Of course there’s issue with different icon sets having different designs (like how Samsung’s 😬 was completely different from everyone else’s), but those tend to get resolved eventually.

          That’s an intrinsic property of using unicode code points. They are abstract identifiers that tell you how to find a glyph. The glyphs can be different. A Chalkboard A and a Times A are totally different pictures because that’s an intrinsic property of text. If Android has a gun and iOS has a waterpistol for their pistol emoji, that’s totally fine for characters but a problem for images.

          1. 16

            😱 Sure emojis are ambiguous . And different groups can use them differently. But that doesn’t mean they don’t convey meaning? The fact that they are so widely used should point towards them being useful no? 😉

            1. 7

              I never said that embedding images in text is not useful. I said that they are not text, do not have the properties of text, and treating them as text causes more problems than it solves.

              1. 3

                Emoji are not alphabets, syllabaries, abugidas, or abjads. But they are ideograms, which qualifies them as a written script.

                1. 1

                  I disagree. At best, they are precursors of an ideographic script. For a writing system, there has to be some kind of broad consensus on semantics and there isn’t for most emoji beyond ‘that is a picture of X’.

                  1. 3

                    For a writing system, there has to be some kind of broad consensus on semantics

                    Please describe to me the semantics of the letter “р”.

                    1. 1

                      Please describe to me the semantics of the letter “р”.

                      For alphabetic writing systems, the semantics of individual letters is defined by their use in words. The letter ‘p’ is a component in many of the words in this post and yours.

                      1. 5

                        Thank you! (That was actually U+0440 CYRILLIC SMALL LETTER ER, which only featured once in both posts, but no matter.)

                        the semantics of individual letters is defined by their use in words

                        The thing is, I disagree. “e” as a letter itself doesn’t have ‘semantics’, only the words including it do[1]. What’s the semantics of the letter “e” in “lobster”? An answer to this question isn’t even wrong. It gets worse when different writing systems interpret the same characters differently: if I write “CCP”, am I referring to the games company CCP Games? Or was I abbreviating сoветская социалистическая республика? What is the semantics of a letter you cannot even identify the system of?

                        Emoji are given meaning of different complexity by their use in a way that begins to qualify them as logographic. Most other writing systems didn’t start out this way, but that doesn’t make them necessarily more valid.

                        [1]: The claim doesn’t even hold in traditional logographic writing systems which by all rights should favor your argument. What is the semantics of the character 湯? Of the second stroke of that character? Again, answers aren’t even wrong unless you defer to the writing system to begin with, in which case there’s no argument about (in)validity.

          2. 13

            Except they don’t. Different in groups assign different meanings to different ones.

            This is true of words as well.

            1. 3

              Yes, but their original point is that we should be able to compose emojis like we compose words, as in the old days of phpBB and instant messaging. :mrgreen:

              1. 13

                Just a nit: people do compose emojis - I see sequences of emojis all the time. People send messages entirely of emojis that other people (not necessarily me) understand.

                1. 9

                  The fact that an in-group can construct a shared language using emoji that’s basically opaque to outsiders is probably a big part of their appeal.

                  1. 8

                    Yeah, and also there’s nothing wrong with that, it’s something any group can and should be able to do. I have no entitlement to be able to understand what other people say to each other (you didn’t claim that, so this isn’t an attack on you. I am just opposed to the “I don’t like / use / understand emojis how other people use them therefore they are bad” sentiment that surfaces periodically).

              2. 4

                That’s fair, I’m just nitpicking a specific point (that happens to be a pet peeve of mine).

            2. 2

              This is true of words as well.

              But not of characters and rarely true even of ideographs in languages that use them (there are exceptions but a language is not useful unless there is broad agreement on meaning). It’s not even true of most words, for the same reason: you can’t use a language for communication unless people ascribe the same meaning to words. Things like slang and jargon rarely change more than a small fraction of the common vocabulary (Clockwork Orange aside).

              1. 6

                Without getting into the philosophy of what language is, I think this skit best illustrates what I mean (as an added bonus, emoji would have resolved the ambiguities in the text).

                Note I’m not arguing for emoji to be in Unicode, I’m just nitpicking the idea that the problem with them is ambiguity.

              2. 2

                Socrates would like to have a chat with you. I won’t go through the philosophical tooth-pulling that he would have enjoyed, but suffice it to say that most people are talking past each other and that most social constructions are not well-founded.

                I suspect that this is a matter of perspective; try formalizing something the size of English Prime (or, in my case, Lojban) and see how quickly your intuitions fail.

      2. 15

        I understand the rationale for including the original emoji (unicode wants to be a superset of existing character sets) but they should have been put in a code space reserved for backwards compatibility with bad ideas, not made such a big part of unicode.

        Except emoji have been absolutely stellar for Unicode: not only are they a huge driver of adoption of unicode (and and through UTF8) because they’re actively desirable to a large proportion of the population, they’ve also been a huge driver of improvements to all sorts of useful unicode features which renderers otherwise tend to ignore despite their usefulness to the rendering of actual text, again because they’re highly desirable and platforms which did not support them got complaints. I fully credit emoji with mysql finally getting their heads out of their ass and releasing a non-broken UTF8 (in 2010 or so). That’s why said unicode consortium has been actively leveraging emoji to force support for more complex compositions.

        And the reality is there ain’t that much difference between “image rendering” and “text pipeline”. Rendering “an image” is much easier than properly rendering complex scripts like arabic, devanagari, or burmese (or Z̸̠̽a̷͍̟̱͔͛͘̚ĺ̸͎̌̄̌g̷͓͈̗̓͌̏̉o̴̢̺̹̕), even ignoring that you can use text presentation if you don’t feel like adding colors to your pileline.

        Even in the ‘90s, instant messaging systems were able to avoid sending common images by having a pre-defined set of pictures that they referenced with short identifiers.

        After all what’s better than one standard if not fifteen?

        This was a solved problem before Unicode got involved and it’s made text processing an increasingly complicated mess, shoving image rendering into text pipelines for no good reason.

        This problem was solved by adding icons in text. Dingbats are as old as printing, and the Zapf Dingbats which unicode inherited date back to the late 70s.

        The web

        Because nobody could ever want icons outside the web, obviously. As demonstrated by Lucida Icons having never existed.

      3. 10

        subset of unicode that removes all of the things that are not text

        It sounds like you disagree solidly with some of Unicode’s practices so maybe this is not so appealing, but FWIW the Unicode character properties would be very handy for defining the subset you’d like to include or exclude. Most languages seem to have a stdlib interface to them, so you could pretty easily promote an ideal of how user input like comment boxes should be sanitized and offer your ideal code for devs to pick up and reuse.

      4. 8

        new character set that is a subset of unicode that removes all of the things that are not text

        and who’d be the gatekeeper on what the text is and isn’t? What would they say about the ancient Egyptian hieroglyphs? Are they text? If yes, why, they are pictures. If no, why, they encode a language.

        It might be a shallow dissimilar, but people trying to tell others what forms of writing text are worthy of being supported by text rendering pipelines gets me going.

        If the implementation is really so problematic, treat emojis as complicated ligatures and render them black and white.

        1. 3

          and who’d be the gatekeeper on what the text is and isn’t? What would they say about the ancient Egyptian hieroglyphs? Are they text? If yes, why, they are pictures. If no, why, they encode a language.

          Hieroglyphics encode a (dead) language. There are different variations on the glyphs depending on who drew them (and what century they lived in) and so they share the property that there is a tight(ish, modulo a few thousand years of drift) coupling between an abstract hieroglyph and meaning and a loose coupling between that abstract hieroglyph and a concrete image that represents it. Recording them as text is useful for processing them because you want to extract the abstract characters and process them.

          The same is true of Chinese (though traditional vs simplified made this a bit more complex and the unicode decisions to represent Kanji and Chinese text using the same code points has complicated things somewhat): you can draw the individual characters in different ways (within certain constraints) and convey the same meaning.

          In contrast, emoji do not convey abstract meaning, they are tightly coupled to the images that are used to represent them. This was demonstrated very clearly by the pistol debacle. Apple decided that a real pistol image was bad because it was used in harassment and decided to replace the image that they rendered with a water pistol. This led to the exact same string being represented by glyphs that conveyed totally different meaning. This is because the glyph not the character encodes meaning for emoji. If you parsed the string as text, there is no possible way of extracting meaning without also knowing the font that is used.

          Since the image is the meaningful bit, not the character, we should store these things as images and use any of the hundreds of images-and-text formats that we already have.

          More pragmatically: unicode represents writing schemes. If a set of images have acquired a significant semantic meaning over time, then they may count as a writing system and so can be included. Instead, things are being added in the emoji space as new things that no one is using yet, to try to define a writing scheme (largely for marketing reasons, so that ‘100 new emoji!’ can be a bullet point on new Android or iOS releases).

          It might be a shallow dissimilar, but people trying to tell others what forms of writing text are worthy of being supported by text rendering pipelines gets me going.

          It’s not just (or even mostly) about the rendering pipelines (though it is annoying there because emoji are totally unlike anything else and have required entirely new feature to be added to font formats to support them), it’s about all of the other things that process text. A core idea of unicode is that text has meaningful semantics distinct from the glyps that they represent. Text is a serialisation of language and can be used to process that language in a somewhat abstract representation. What, aside from rendering, can you do with processing of emoji as text that is useful? Can you sort them according to the current locale meaningfully, for example (seriously, how should 🐕 and 🍆 be sorted - they’re in Unicode and so that has to be specified for every locale)? Can you translate them into a different language? Can you extract phonemes from them? Can you, in fact, do anything useful with them that you couldn’t do if you embedded them as images with alt text?

          1. 11

            Statistically, no-one cares about hieroglyphics, but lots of people care about being able to preserve emojis intact. So text rendering pipelines need to deal with emojis, which means we get proper hieroglyphics (and other Unicode) “for free”.

            Plus, being responsible for emoji gives the Unicode Consortium the sort of PR coverage most organizations spend billions to achieve. If this helps them get even more ancient writing systems implemented, it’s a net good.

          2. 2

            What, aside from rendering, can you do with processing of emoji as text that is useful?

            Today, I s/☑️/✅/g a text file.

            Can you sort them according to the current locale meaningfully, for example (seriously, how should 🐕 and 🍆 be sorted - they’re in Unicode and so that has to be specified for every locale)?

            Do I have the book for you!

            Can you translate them into a different language? Can you extract phonemes from them?

            We can’t even do that with a lot of text! 😝

      5. 8

        At this point, there’s a strong argument for a new character set that is a subset of unicode that removes all of the things that are not text.

        All that’s missing from this sentence to set off all the 🚩 🚩 🚩 is a word like “just” or “simply”.

        Others have started poking at your definition of “text”, and are correct to do so – are hieroglyphs “text”? how about ideograms? logograms? – but really the problem is that while you may feel you have a consistent rule for demarcating “text” from “images” (or any other “not text” things), standards require getting a bunch of other people to agree with your rule. And that’s going to be difficult, because any such rule will be arbitrary. Yours, for example, mostly seem to count certain very-image-like things as “text” if they’ve been around long enough (Chinese logograms, Egyptian hieroglyphics) while counting other newer ones as “not text” (emoji). So one might reasonably ask you where the line is: how old does the usage have to be in order to make the jump from “image” to “text”? And since you seem to be fixated on a requirement that emoji should always render the same on every platform, what are you going to do about all the variant letter and letter-like characters that are already in Unicode? Do we really need both U+03A9 GREEK LETTER CAPITAL OMEGA and U+2126 OHM SIGN?

        etc.

        1. 1

          So one might reasonably ask you where the line is: how old does the usage have to be in order to make the jump from “image” to “text”?

          Do they serialise language? They’re text. Emoji are not a writing system. They might be a precursor to a writing system (most ideographic writing systems started with pictures and were then formalised) but that doesn’t happen until people ascribe common meaning to them beyond ‘this is a picture of X’.

          And since you seem to be fixated on a requirement that emoji should always render the same on every platform, what are you going to do about all the variant letter and letter-like characters that are already in Unicode?

          That’s the opposite of my point. Unicode code points represent an abstraction. They are not supposed to require an exact glyph. There are some things in Unicode to allow lossless round tripping through existing character encodings that could be represented as sequences of combining diacritics. They’re not idea in a pure-Unicode world but they are essential for Unicode’s purpose: being able to represent all text in a form amenable to processing.

          For each character, there is a large space of possible glyphs that a reader will recognise. The letter A might be anything from a monospaced block character to a curved illustrated drop character from an illuminated manuscript. The picture is not closely coupled to the meaning and changing the picture within that space does not alter the semantics. Emoji do not have that property. They cause confusion when slightly different glyphs are used. Buzzfeed and similar places are full of ‘funny’ exchanges from people interpreting emoji differently, often because they see slightly different glyphs.

          The way that emoji are used assumes that the receiver of a message will see exactly the same glyph that the sender sends. That isn’t necessary for any writing system. If I send Unicode of English, Greek, Icelandic, Chinese, or ancient Egyptian, the reader’s understanding will not change if they change fonts (as long as the fonts don’t omit glyphs for characters in that space). If someone sends a Unicode message containing emoji, they don’t have that guarantee because there is no abstract semantics associated with them. I send a picture of a dog, you see a different dog, I make a reference to a feature of that dog and that feature isn’t present in your font, you are confused. Non-geeks in my acquaintance refer to them as ‘little pictures’ and think of them in the same way as embedded GIFs. Treating them as characters causes problems but does not solve any problems.

          1. 2

            Do they serialise language? They’re text. Emoji are not a writing system. They might be a precursor to a writing system (most ideographic writing systems started with pictures and were then formalised) but that doesn’t happen until people ascribe common meaning to them beyond ‘this is a picture of X’.

            I think this is going to end up being a far deeper and more complex rabbit hole than the tone of your comment anticipates. Plenty of things that are in Unicode today, and that you undoubtedly would consider to be “text”, do not hold up to this criterion.

            For example, any character that has regional/dialect/language-specific variations in pronunciation seems to be right out by your rules. So consider, say, Spanish, where in some dialects the sound of something like the “c” in “Barcelona” is /s/ and in others it’s /θ/. It seems hard to say that speakers of different dialects agree on what that character stands for.

      6. 4

        At this point, I feel like the cat is out of the bag; people are used to being able to use emoji in almost any text-entry context. Text rendering pipelines are now stuck supporting these icons. With that being the case, wouldn’t it be way more complexity to layer another parsing scheme on top of Unicode in order to represent emoji? I can see the argument that they shouldn’t have been put in there in the first place, but it doesn’t seem like it would be worth it to try to remove them now that they’re already there.

    12. 2

      The GET request technically uses the parameter resource=acct:justin@mastodon.jgarr.net but with this static file example we only have one user on the domain so we’ll ignore that part. If you want to have multiple users on the same domain you will have to handle parameters on the server side. Meaning, you can’t do that with static files.

      Since there’s only one query parameter I think you can still do this with static files; you can construct a rewrite rule to turn https://server/.well-known/webfinger?resource=acct:user@domain into https://server/.well-known/webfinger/accts/user@domain which can just be one static file per user.

      In theory this is not compliant because the server should treat ?ignore=this&resource=acct:user@domain as the same as ?resource=acct:user@domain for arbitrary extra keys, which the rewrite rule won’t catch, but in practice this seems very unlikely. There’s probably some smarter tricks than a regular rewrite you could do with .htaccess to make this work but I’m not a web server expert.

      1. 4

        Nginx has the ability to parse the query string for you so that your configuration can depend on specific parameters’ values. My server has something like

        location /.well-known/webfinger {
            if ($arg_resource = "acct:myname@mydomain.social") {
                rewrite .+ /my/static/webfinger/file.json last;
            }
        }
        

        Other parameters in the query string will simply be ignored.

    13. 2

      Looks neat, but you seem to think that an outer join and a left join are the same thing. That’s not true. (incorrect page: https://antonz.org/sql-cheatsheet/ )

      I did a tutorial today where I explained how joins worked to a student. You can view the whiteboard here for my explanation: https://jamboard.google.com/d/17V39ADf01zcRs3OIDtQqwc41d7To8LKjzOzIXHOsnd0/edit?usp=sharing

      1. 3

        Of course I don’t think so. Nor does the article say so.

        1. 8

          I think they’re saying it because the linked page has a heading “Outer JOIN (LEFT JOIN)” and some text “An outer (left) JOIN, on the other hand, says [definition]”. I don’t know if you intended it, but this parenthetical looks left join is an alternate name for outer join, and having only one definition implies they have the same definition. If you don’t mean to say so, that would be a good place to add a sentence like “While outer join and left join are very similar, the difference is that…”

          1. 1

            Perhaps that’s confusing, but it is a completely standard use of parentheses in English.

            1. 4

              Do you have a source for it being standard English to use parentheses to note — if I understand correctly — alternative pieces of text that, if all substituted for the pieces of text before them (in some context-dependent way) would give an equally true sentence?

              As a native English user, the only ‘English’ writing in which I’ve ever seen that use of parentheses is the Encyclopedia of Mathematics, which was translated from Russian and somewhat often uses grammar that sounds odd in English. Some examples of this use of parentheses in that encyclopedia can be seen in https://encyclopediaofmath.org/wiki/Vector and https://encyclopediaofmath.org/wiki/Neighbourhood. Personally, I find it confusing, particularly when they use parentheses both in this way and in the more usual way of clarifying, rather than replacing, something preceding.

              1. 3

                Something similar is used commonly in mathematical writing in English, but only along with the word “respectively” or the abbreviation “resp.”. See https://linguaphiles.livejournal.com/2058743.html.

              2. 2

                I don’t have a source, but I’ve definitely seen it, although it’s not super common. E.g.: A right- (resp: left-) handed person is someone who writes with their right (left) hand, and is usually left- (right-) footed.

                1. 1

                  I wouldn’t consider that the same phenomenon, when it’s marked with “resp.” (after German “bzw.”), whereas here and in the EoM it’s unmarked and looks the same as the standard use of parentheses to clarify what was just said.

            2. 1

              When one is writing a piece like the OP, the goal is to be understood by the audience (whoever that may be). Conforming to standard syntax is not usually a goal, in and of itself; it’s desirable mostly inasmuch as using familiar grammar will help you to be understood. Saying “this text may be confusing, but it’s standard” misses the point: the goal was never to be standard—it was to be understandable!

        2. 2

          Fair enough, silly of me to think so. pushcx is right that I think “Outer (left) join” is confusing/misleading.

      2. 1

        Looking at both Postgres and SQL Server, LEFT JOIN and LEFT OUTER JOIN are equivalent as are JOIN and INNER JOIN. Both INNER and OUTER are optional keywords, included for compatibility.

        Am I missing something?

        1. 2

          A left join is an outer join, but not all outer joins are left joins.

          In the tutorial whiteboard, I called a full outer join simply “outer join”, which is maybe confusing you, and was probably a bad idea on my part.

    14. 24

      Okay, but it’s not a Mastodon instance. It’s a static ActivityPub instance that’s extremely Mastodon-compatible…

      1. 14

        It’s actually a lot closer to what I want: a way of publishing a static blog that makes it easy for Mastodon users to follow.

        The follower count is interesting. The article notes that Mastodon doesn’t check it but misses the fact that Mastodon can’t possibly check it. In a federated system, a single instance shouldn’t let untrusted instances see who on that system is following people for privacy reasons (and may also want to avoid sharing aggregated information if the activity is anonymity sets are sufficiently small). Even if it did, there’s nothing stopping me from creating a Mastodon instance with 10,000,000 accounts all following one person.

        It might be interesting to do some kind of aggregated thing of ‘these 20 instances all have 100+ followers of this account’, presenting something signed by each of those servers’ private keys, which would suggest that a person has a broad set of followers. You could trivially fake this by having a single machine host 20 domains, but then other instances can rely on their own reputation rankings for the following servers and so ignore ones where no one on your instance is following them and so on.

        I guess ‘number of followers’ has become important for ‘influencers’ and so Mastodon needs to present this number, but hopefully the ease with which it can be faked will make it fall out of use if Fediverse things take off.

        It’s a shame that there isn’t something in the various Fediverse protocols for privacy-preserving aggregated analytics, where small instances can nominate one or more systems that will track their reputation and share it with other instances.

        1. 6

          ActivityPub really feels like a step back from RSS/Atom if your use case is hosting a static blog. Instead of just publishing a static XML file, now you need a server that can respond dynamically according to a much more complicated protocol. The server also needs to persist data (the list of followers) and it needs to hit a bunch of other servers whenever the blog is updated.

          This model seems way worse for privacy, too. Following an RSS feed is inherently semi-anonymous, and you can specify your own User-Agent header and use a proxy/VPN if you want to blend into the crowd even more. Following an ActivityPub-based blog without exposing your identity would require even more work than is described in the OP.

          1. 4

            Agreed. I publish a blog, like a normal person, and I wrote a janky Perl script to echo the RSS content to a fedi account. If you know Python you can even dispense with the jank!

            Trying to force SSG blogging into the mold of current ActivityPub just feels like a roundabout way to get a dead blog in a couple of months.

          2. 3

            Following an ActivityPub-based blog without exposing your identity would require even more work than is described in the OP.

            It’s not that much harder, you can just keep polling the outbox like you’d do with an RSS feed. And on the plus side you don’t have to deal with XML and you can paginate instead of getting cut off at an arbitrary number of items.

            curl -s https://universeodon.com/users/georgetakei/outbox\?page\=true | jq -r '.orderedItems[]|select(.type=="Create")|[.published,.object.content]|@tsv'|sed -e 's/<[^>]*>//g'

            1. 2

              Now that you mention it, I think I remember that Mastodon offers a ready-made RSS feed of a user’s outbox?

              Still, though, (1) that statement is true for Mastodon, but probably not for most other ActivityPub software, and (2) even if readers can consume RSS just like they would for “static site” blogs, there’s still a lot of complexity for the publisher.

        2. 2

          Personally, what I’m interested in, is what you wrote (easy for Mastodon users to follow), plus comments via Mastodon. Meaning, I’d like to make it easy for Mastodon users to post replies to my blogposts, which would then be automatically appended into the HTML of the blogpost (that they replied to) as comments. Such that afterwards the page with the comments could still be served as a static HTML.

          Perhaps obviously, it would also need some HTML sanitization, rate-/capacity-limiting, and moderation tools. For moderation, hopefully some kind of a static blocklist of users & servers, checked before appending the comment, might be enough? (Plus a way to purge any matching already-appended comments when I add new entries to the blocklist.) Then, hopefully, the capacity limit would be just a failsafe until I add the offending spammer to the blocklist.

    15. 6

      Every one of them was via Apple Pay, which does not do the typo check as Apple tells us the email directly.

      That means these people have Apple IDs in the CON domain. I’m sure that’s working out really well for them.

      When you use Apple Pay to pay for something, there’s no requirement that you use the email address associated with your Apple ID. By default, the interface offers you the email addresses that are on your contact card; you can also type in a different one.

      (Someone in the blog comments makes a similarly uninformed guess about how Apple Pay works, earning them an acerbic reply from jwz. Irony!)

    16. 2

      Looks like Pagefind isn’t available through Nix. It seems like the build product is (more or less) a single binary, so conceptually it should be easy enough to package with Nix, but some include_bytes calls during the build are failing to find the files they’re supposed to.

        1. 2

          Ooh, thank you!

    17. 5

      On iOS safari, this site renders every differently-styled word with a huge amount of horizontal space surrounding it. I’d love to read these posts but this is so distracting that I’m having a hard time doing it in the situations when I’m browsing lobsters.

      1. 4

        Looks like the culprit is the rule

        @media only screen and (max-width: 1000px) {
            main > * > * {
                padding: 0 4rem;
            }
        }
        

        for whatever that’s worth.

      2. 3

        Ohp, thanks! I’ll get this to the team and try and get that solved.

      3. 2

        Oh wow, then don’t look at the newer one about schemas (fired up the iPhone because I was curious), this one is very readable :P

    18. 45

      for signing events and requests to work, matrix expects the json to be in canonical form, except the spec doesn’t actually define what the canonical json form is strictly

      I’m astonished by how often this mistake is repeated. I’ve been yelling into the void about it for what feels like an eternity, but I’ll yell once more, here and now: JSON doesn’t define, specify, guarantee, or even in practice reliably offer any kind of stable, deterministic, or (ha!) bijective encoding. Which means any signature you make on a JSON payload is never gonna be sound. You can’t sign JSON.

      If you want to enforce some kind of canonicalization of JSON bytes, that’s fine!! and you can (maybe) sign those bytes. But that means that those bytes are no longer JSON! They’re a separate protocol, or type, or whatever, which is subject to the rules of your canonical spec. You can’t send them over HTTP with Content-Type: application/json, you can’t parse them with a JSON parser, etc. etc. with the assumption that the payload will be stable over time and space.

      1. 10

        Oh god, I thought we had learned our lesson from Secure Scuttlebutt. Come on people.

        1. 8

          For anyone else who didn’t know what this referred to, a bit of searching led me to this post, which I did a Find for “JSON” in.

          Edit: adding quotes around “JSON”.

      2. 9

        canonical json is actually pretty well defined in some matrix spec appendix if i recall?

        1. 9

          Matrix Specification - Appendices § 3.1. Canonical JSON. I haven’t reviewed to see just how “canonical” it is/whether it truly excludes all but one interpretations/productions of a given object etc., but that’s been part of the spec since no later than v1.1 (November 2021), maybe earlier.

      3. 6

        Doesn’t https://www.rfc-editor.org/rfc/rfc8785 specify a good enough canonical form?

        1. 18

          It’s a perfectly lovely canonical form, but it’s not mandatory. JSON parsers will still happily accept any other non-canonical form, as long as it remains spec-compliant. Which means the JSON payloads {"a":1} and { "a": 1 } represent exactly the same value, and that parsers must treat them as equivalent.

          If you want well-defined and deterministic encoding, which produces payloads that can be e.g. signed, then you need guarantees at the spec level, like what e.g. CBOR provides. There are others. (Protobuf is explicitly not one!!)

          1. 8

            Of course other forms are equivalent, but only one is canonical. That’s what the word means.

            1. 7

              Sure, if you want to parse received JSON payloads and then re-encode them in your canonical form, you can trust that output to be stable. Just as long as you don’t sign the payload you received directly!

          2. 4

            That reminds me, I wonder what the status is on low bandwidth Matrix, which uses CBOR.

            When I read about it, I was wondering why a high bandwidth Matrix would be the default, if they do the same thing. Now I wonder for more reasons.

          3. 1

            Can you say more about Protobuf not guaranteeing a deterministic encoding at the spec level? Is it that the encoding is deterministic for a given library, but this is left to the implementation rather than the spec? Does the spec say something about purposely leaving this open?

            1. 2

              Protobuf encoding is explicitly defined to be nondeterministic and unstable.

              https://protobuf.dev/programming-guides/encoding/#order

              When a message is serialized, there is no guaranteed order for how its known or unknown fields will be written. Serialization order is an implementation detail, and the details of any particular implementation may change in the future.

              Do not assume the byte output of a serialized message is stable.

              By default, repeated invocations of serialization methods on the same protocol buffer message instance may not produce the same byte output. That is, the default serialization is not deterministic.

              https://protobuf.dev/programming-guides/dos-donts/#serialization-stability

              Never Rely on Serialization Stability

              Needless to say, you should never sign a protobuf payload :)

              1. 1

                Thanks I think I’m gonna use Rivest’s S-expressions.

              2. 1

                Serialization order is an implementation detail, and the details of any particular implementation may change in the future.

                The implementation is explicitly allowed to have a deterministic order.

                At the spec level it’s undefined.

                At the implementation level, it may be defined. That’s typical for all such technologies across the industry.

                Never Rely on Serialization Stability Across Builds

                An important detail was omitted.

                1. 1

                  The implementation is explicitly allowed to have a deterministic order.

                  At the spec level it’s undefined.

                  Yes, which is my point — unless you’re operating in a hermetically sealed environment, senders can’t assume anything about the implementation of receivers, and vice versa. You can maybe rely on the same implementation in an e.g. unit test, but not in a running process. The only guarantees that can be assumed in general are those established by the spec.

                  Across Builds

                  Exact same thing here — modulo hermetically sealed environments, senders can’t assume anything about the build used by receivers, and vice versa.

        2. 1

          Tangentially, reading the spec,

          Sorting of Object Properties […] formatted as arrays of UTF-16

          JSON is UTF-8, not UTF-16.

          That spec should specify Unicode order (Unicode, ASCII, and UTF-8, UTF-32 all share sorting order), not UTF-16 as UTF-16 has some code points out of order. That was one of the reasons why UTF-8 was created.

          Also, we don’t sort JSON object keys for cryptography. Order is inherited from the UTF-8 serialization for verification. Afterward, the object may be be unmarshalled however seen fit. This allows arbitrary order.

      4. 6

        One does not sign JSON, one signs a bytearray. That multiple JSON serializations can have the same content does not matter. One could even argue that it’s a feature: the hash of the bytearray is less predictable which makes it more secure.

        I do not get the hangup on canonicalization. Just keep the original bytearray with the signature: done.

        Lower in this thread a base64 encoding is proposed. Nonsense, just use the bytearray of the message. What the internal format is, is irrelevant. It might be JSON-LD, RDF/XML, Turtle, it does not matter for the validity of the signature. The signature applies to the bytearray: this specific serialization.

        Trying to deal with canonicalization is a non-productive intellectual hobby that makes specifications far too long, complex and error prone. It hinders adoption of digital signatures.

        1. 5

          Nonsense, just use the bytearray of the message.

          A JSON payload (byte array) is explicitly not guaranteed to be consistent between sender and receiver.

          What the internal format is, is irrelevant. It might be JSON-LD, RDF/XML, Turtle, it does not matter for the validity of the signature. The signature applies to the bytearray: this specific serialization.

          This is very difficult to enforce in practice, for JSON payloads particularly.

          1. 5

            Of course a bytearray is consistent. There’s a bytearray. It has a hash. The bytearray can be digitally signed. Perhaps the bytearray can be parsed as a JSON document. That makes it a digitally signed JSON document. It’s very simple.

            Data sent from sender to receiver is sent as a bytearray. The signature will remain valid for the bytearray. Just don’t try to parse and serialize it and hope to get back the same bytearray. That’s a pointless exercise. Why would you do that? If you know it will not work, don’t do it. Keep the bytearray.

            What is hard to enforce? When I send someone a bytearray with a digital signature, they can check the signature. If they want to play some convoluted exercise of parsing, normalizing, serializing and hoping for the same bytearray, you can do so, but don’t write such silliness in specifications. It just makes them fragile.

            Sending bytearrays is not hard to do, it’s all that computers do. Even in browsers, there is access to the bytearray.

            Canonicalization is immature optimization.

            1. 5

              Of course a bytearray is consistent. There’s a bytearray. It has a hash. The bytearray can be digitally signed. Perhaps the bytearray can be parsed as a JSON document. That makes it a digitally signed JSON document. It’s very simple.

              If you send that byte array in an HTTP body with e.g. Content-Type: octet-stream, yes — that marks the bytes as opaque, and prevents middleboxes from parsing and manipulating them. But with Content-Type: application/json, it’s a different story — that marks the bytes as representing a JSON object, which means they’re free to be parsed and re-encoded by any middlebox that satisfies the rules laid out by JSON. This is not uncommon, CDNs will sometimes compact JSON as optimizations. And it’s this case I’m mostly speaking about.

              I’m not trying to be difficult, or speculating about theoreticals, or looking for any kind of argument. I’m speaking from experience, this is real stuff that actually happens and breaks critical assumptions made by a lot of software.

              If you sign a JSON encoding of something, and include the bytes you signed directly alongside the signature as opaque bytes — i.e. explicitly not as a sibling or child object in the JSON message that includes the signature — then no problem at all.

              tl;dr: sending signatures with JSON gotta be like {"sig":"XXX", "msg":"XXX"}

              1. 5

                Such CDNs would break Subresource Integrity and etag caching. Compression is a much more powerful optimization than removing a bit of whitespace, so it’s broken and inefficient. Changing the content in any way based on a mimetype is dangerous. If a publisher uses a CDN with such features, they should know to disable them when the integrity of the content matters.

                I’m sending all my mails with a digital signature (RFC 4880 and 3156). That signature is not applied to a canonicalized form of the mail apart from having standardized line endings. It’s applied to the bytes. Mail servers should not touch the content other than adding headers.

                1. 3

                  Changing the content in any way based on a mimetype is dangerous.

                  Dangerous or not, if something says it’s JSON, it’s subject to the rules defined by JSON. A proxy that transforms the payload according to those rules might have to intermediate on lower-level concerns, like Etag (as you mention). But doing so would be perfectly valid.

                  And it’s not limited to CDNs. If I write a program that sends or receives JSON over HTTP, any third-party middleware I wire into my stack can do the same kind of thing, often without my knowledge.

                  I’m sending all my mails with a digital signature (RFC 4880 and 3156). That signature is not applied to a canonicalized form of the mail apart from having standardized line endings. It’s applied to the bytes. Mail servers should not touch the content other than adding headers.

                  Yes, sure. But AFAIK there is no concept of a “mail object” that’s analogous to a JSON object, is there?

                  1. 2

                    Dangerous or not, if something says it’s JSON, it’s subject to the rules defined by JSON.

                    A digital signature does not apply to JSON. It applies to a bytearray. If an intermediary is in a position to modify the data it transmits and does not pass along a bytearray unchanged, it’s broken for the purpose of passing on data reliably and should not be used.

                    Canonicalization cannot work sustainably because as soon as it does some new ruleset is thought up by people that enjoy designing puzzles more than creating useful software. Canonicalization has a use when you want to compare documents, but is a liability in the context of digital signatures.

                    A digital signature is meant to prove that a bytearray was endorsed by an entity with a private key.

                    If any intermediary mangles the bytearray, the signature becomes useless and the intermediary should be avoided. An algorithm that tries to undo the damage done by broken intermediaries is not the solution. Either the signature matches the bytearray or it does not.

                    1. 2

                      A digital signature does not apply to JSON. It applies to a bytearray.

                      100% agreement.

                      If an intermediary is in a position to modify the data it transmits and does not pass along a bytearray unchanged, it’s broken for the purpose of passing on data reliably and should not be used.

                      Again 100% agreement, which supports my point that you can’t sign JSON payloads, because JSON explicitly does not guarantee that any encoded form will be preserved reliably over any transport!

                      1. 2

                        JSON explicitly does not guarantee that any encoded form will be preserved reliably over any transport!

                        Citation needed. I can read nothing about this in RFC 8259. Perhaps your observation is a fatalist attitude that springs from working with broken software. Once you allow this for JSON, what’s next? Re-encoding JPEGs, adding tracking watermarks to documents? No transport should modify the payload that it is transporting. If it does, it’s broken.

                        There is no guarantee about the behavior transports in the JSON RFC 8259. There is also no text that allows serialization to change for certain transports.

                        1. 1

                          Once you allow this for JSON, what’s next? Re-encoding JPEGs, adding tracking watermarks to documents?

                          Yes, sure. If the payloads are tagged as specific things with defined specs, intermediaries are free to modify them in any way that doesn’t violate the spec. This isn’t my speculation, or fatalism, it’s direct real-world experience.

                          No transport should modify the payload that it is transporting. If it does, it’s broken.

                          If you want to ensure that your payload bytes aren’t modified, then you need to make sure they’re opaque. If you want to send such bytes in a JSON payload, you need to mark the payload as something other than JSON, or encode those bytes in a JSON string.

        2. 4

          You might be missing the core info about why many signed JSON APIs are trash: they include the signature in the same JSON document as the thing they sign:

          {
              "username": "Colin",
              "message": "Hi!",
              "signature": "some base 64 string"
          }
          

          The signature is calculated for a JSON serialization of a dict with, in this example, the keys username and message, then the signature key is added to the dict. This modified dict is serialised again and sent over the network.

          This means that the client doesn’t have the original byte array. It needs to parse the JSON it was given, remove the signature key, and then serialize again in some way that generates exactly the same bytes, and then it can sign those bytes and validate the message.

          This is clearly completely bonkers, but several protocols do variations on this, including Matrix, Secure Scuttlebut, and whatever this is https://cyberphone.github.io/doc/security/jsf.html#Sample_Object

          The PayPal APIs do the thing you’re thinking of: they generate some bytes (which you can parse to JSON) and provide the signature as a separate value (as an HTTP header, I think).

          @peterbourgon’s suggestion also avoids the core issue and additionally protects against middle boxes messing with the bytes (which I agree they shouldn’t do, but they do so 🤷) and makes the easiest way of validating the signature also the correct way.

          (If the application developer’s web framework automatically parses JSON then you just know that some of them are going to remove the signature key, reserialise and hash that (I’ve seen several people on GitHub try to do this with the JSON PayPal produces))

          The PayPal way is fine, but you then get into the question of how to transmit two values instead of one. You can use HTTP headers or multipart encoding, but now your protocol is tied to HTTP and users need to understand those things as well as JSON. Peter’s suggestion requires users only to understand JSON and some encoding like base64.

          A final practical point: webservers sometimes want to consume the request body and throw it away if they can parse it into another format (elixir phoenix does this, for efficiency, they say), so your users may need to provide a custom middleware for your protocol and get it to run before the default JSON middleware, which is likely to be more difficult for them than turning a base64 string back into JSON.

      5. 5

        likewise, it really frustrates me. I’m not surprised, just annoyed, because it’s an aspect of things that always gets fixed as an afterthought in cryptography-related standards…

        nobody likes ASN.1, especially not the experts in it, but it exists for a reason. text-based serialization formats don’t canonicalize easily and specifying a canonicalization is extra work. even some binary formats, such as protocol buffers, don’t necessarily use a canonical form (varints are the culprit there).

        1. 5

          ASN.1 does not help with canonicalization either. It has loads of different wire encodings, e.g. BER, PER. For cryptographic purposes you must use DER, which is BER with extra rules to say which of the many alternative forms in BER must be used, e.g. forbidding encodings of integers with leading zeroes.

          1. 2

            yes, that’s fair.

      6. 4

        Huge Cosmos SDK vibes.

        Signing messages was an entire procedure involving ordering JSON fields alphanumerically, minifying and then signing the hash.

        So many hours have been spent because a client, typically not written in Go, would order a field differently, yielding a different hash.

        Good times.

        1. 3

          Brother, I’ve got some stories. I’ve actually filed a CVE to the Cosmos SDK for a signing-related issue. (Spoiler: closed without action.)

          1. 3

            Yup, sounds like an SDK episode to me.

            I think I remember seeing your name on a GitHub issue conversation, with the same couple of “adversaries” justifying their actions lol.

            I distanced myself from that ecosystem both professionally and hobby-wise because I did not like how the tech stack was implemented, and how the governance behaved.

            Although most of the bad decisions have been inherited from a rather… peculiar previous leadership.

      7. 2

        A solution that I like for this is base64 encoding the json, and signing the base64 blob.

        Which is a roundabout way to agree: don’t sign json.

        1. 9

          …but this has the same problem? If you reorder the keys in an object in the JSON, you’re going to get a different base64 string.

          1. 7

            No. The point is that you get a different base64 string. It makes it obvious that the message was tampered with.

            The problem is that when canonicalizing json, there are multiple json byte sequences that can be validated with a given signature.

            A bug in canonicalizing may lead to accepting a message that should not have been accepted. For example, you may have duplicate fields. One json parser may take the first duplicate, one may take the last, and if you canonicalized after parsing and passed the message along, now you can inject malicious values:

            {
               "signed-field": "good-value"
               "signed-field": "malicious-payload"
            }
            

            You may say “but if you follow the RFC, don’t use the stock json libraries that try to make things convenient, and are really careful, you’re protected”. You’d be right, but it’s a tall order.

            With base64, there’s only one message that will validate with a given signature (birthday attacks aside). It’s much harder to get wrong.

          2. 4

            Well, not exactly. {"a":1} and { "a": 1 } are different byte sequences, and equivalent JSON payloads. But the base64 encodings of those payloads are different byte sequences, and different base64 payloads – base64 is bijective. (Or, at least, some versions of base64.)

            1. 5

              Another way to phrase this is that it makes it hard to shoot yourself in the foot. If you get straight JSON over the wire, what do you do? You need to parse it in order to canonicalize it, but your JSON parser probably doesn’t parse it the way you need it to in order to canonicalize it for verification, so now you have to do a bunch of weird stuff to try and parse it yourself, and maybe serialize a canonicalized version again just for verification, etc.

              The advantage of using base64 or something like it (e.g. straight hex encoding as mentioned in your sibling comment) is that it makes it obvious that you should stop pretending that you can reasonably sign a format that can’t be treated as “just a stream of bytes” (because you can’t - a signature over a stream of bytes is the only cryptographic primitive we have, so what you’re actually doing by “canonicalizing JSON” is turning the JSON into a stream of bytes, poorly) and just sign something that is directly and solely a stream of bytes.

              Edit: the problem with this is that you’ve now doubled your storage cost. The advantage of signing JSON is that you can deserialize, store that in a database alongside the signature, and reconstruct functionally the same thing if you need to retransmit the original message (for example to sync a room up to a newly-joined Matrix server). If you’re signing base64/hex-encoded blobs, you now need to store the original message that was signed, rather than being able to reconstruct it on-the-fly. But a stream of bits isn’t conducive to e.g. database searches, so you still have to store the deserialized version too. Hence: 2x storage.

              1. 3

                Another way to phrase this is that it makes it hard to shoot yourself in the foot. If you get straight JSON over the wire, what do you do? You need to parse it in order to canonicalize it,

                Even doing that much I would consider to be a success!

                One, It’s rare that a canonical form is even defined, and more rare still that it’s defined in a way that’s actually unambiguous. I’m dubious that Matrix’s canonical JSON spec (linked elsewhere) qualifies.

                Two, even if you have those rules, it’s rare that I’ve ever seen code that follows them. Usually a project will assume the straight JSON from the wire is canonical, and sign/verify those wire bytes directly. Or, it might parse the wire bytes into a value, but then it will sign/verify the bytes produced by the language default JSON encoder, assuming those bytes will be canonical.

            2. 4

              I don’t understand why a distinction between reordering keys and changing whitespace needs to be made. Are they treated differently in the JSON RFC?

              equivalent JSON payloads

              Equivalent according to whom? The JSON RFC doesn’t define equality.

              Are you simply saying that defining a canonical key ordering wouldn’t be sufficient since you’d need to define canonical whitespace too? If so, I don’t understand why it contradicts bdesham’s comment, since they just gave a single example of what base64 doesn’t canonicalize.

              1. 4

                I don’t understand why a distinction between reordering keys and changing whitespace needs to be made. Are they treated differently in the JSON RFC?

                I didn’t mean to distinguish key order and whitespace. Both are equally and explicitly defined to be arbitrary by the JSON spec.

                Equivalent according to whom? The JSON RFC doesn’t define equality.

                Let me rephrase: {a":1,"b":2} and {"b":2,"a":1} and { "a": 1, "b": 2 } are all different byte sequences, but represent exactly the same JSON object. The RFC specifies JSON object equality to at least this degree — we’ll ignore stuff like IEEE float precision 😉 If you defined a canonical encoding, your parser would reject non-canonical input, which isn’t permitted by the JSON spec, and means you’re no longer speaking JSON.

                1. 4

                  The RFC specifies JSON object equality to at least this degree

                  I don’t think so. At least RFC 8259 doesn’t identify any (!) of those terms. (It can’t for at least two reasons: it doesn’t know how to compare strings, and it explicitly says ordering of kv pairs may be exposed as semantically meaningful to consumers.)

                  JSON is semantically hopeless.

                  1. 2

                    RFC 8259 … explicitly says ordering of kv pairs may be exposed as semantically meaningful to consumers

                    Where? I searched for “order” and didn’t find anything that would imply this conclusion, AFACT.

                    Here’s what I did find:

                    An object is an unordered collection of zero or more name/value pairs, where a name is a string and a value is a string, number, boolean, null, object, or array.

                    and

                    JSON parsing libraries … differ as to whether or not they make the ordering of object members visible to calling software. Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences

                    which to me seems to pretty clearly say that order can’t matter to implementations. Maybe I’m misreading.

                    JSON is semantically hopeless.

                    JSON is an encoding format that’s human-readable, basically ubiquitous, and more or less able to express what most people need to express. These benefits hugely outweigh the semantic hopelessness you point out, I think.

                    1. 4

                      I think you did misread it, I’m afraid.

                      Those are the quotes I mean, particularly the latter one:

                      JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible to calling software. Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences.

                      Left unsaid is that implementations that do depend on or expose member ordering may not be interoperable in that sense. And we know they are still implementations of JSON because of the first sentence there. (“Left unsaid” in that one can infer that anything goes from the first sentence taken with the contrapositive of the second.) Slightly weaselly language like this exists throughout the RFC, including in areas related to string and number comparison. If I understand correctly, while many of those involved wanted to pin down JSON’s semantics somewhat, they could not reach agreement.

                      JSON is an encoding format that’s human-readable, basically ubiquitous, and more or less able to express what most people need to express. These benefits hugely outweigh the semantic hopelessness you point out, I think.

                      You might be right. That “more or less” gives me the heebie-jeebies though, because without semantics, the well-known security and interoperability problems will just keep happening. People never really just use JSON, there’s always some often-unspoken understanding about a semantics for JSON involved. Otherwise they couldn’t communicate at all. (The JSON texts would have to remain uninterpreted blobs.) And where parties differ in the fine detail of that understanding, they will reliably miscommunicate.

                      1. 1

                        Implementations whose behavior does not depend on member ordering will be interoperable in the sense that they will not be affected by these differences.

                        I read this as supporting my interpretation, rather than refuting it. I read it as saying that implementations must be interoperable (i.e. produce equivalent outcomes) regardless of ordering.

                        Slightly weaselly language like this exists throughout the RFC, including in areas related to string and number comparison.

                        Totally agreed! And in these cases, implementations have no choice but to treat the full range of possibilities as possibilities, they can’t make narrower assumptions while still remaining compliant with the spec as written.

                        1. 2

                          Implementations whose behavior does not depend on member ordering […] will not be affected by these differences.

                          It’s a tautology. If you don’t depend on the ordering, you won’t be affected by the ordering. It doesn’t anywhere say that an implementation must not depend on the ordering.

                          The wording is very similar to the wording in sections regarding string comparison, which if I understand you correctly, you believe is an underdefined area. From section 8.3:

                          Implementations that [pick a certain strategy] are interoperable in the sense that implementations will agree in all cases on equality or inequality of two strings

                          Again unsaid: those that don’t may not so agree.

                          1. 1

                            It’s a tautology. If you don’t depend on the ordering, you won’t be affected by the ordering. It doesn’t anywhere say that an implementation must not depend on the ordering.

                            It says that

                            An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings.

                            Meaning, as long as objects keys are unique, two JSON payloads with the same set of name-value mappings must be “interoperable” (i.e. semantically equivalent JSON objects) regardless of key order or whitespace or etc.

                            1. 2

                              No, it says they’ll agree on the name-value mappings. It doesn’t say anything there about whether they can observe or will agree on the ordering - that’s the purpose of the following paragraph, talking about ordering.

                              1. 1

                                Agreeing on name-value mappings is necessarily order-invariant. If this weren’t the case, then the object represented by {"a":1,"b":2} wouldn’t be interoperable with (i.e. equivalent to) the object represented by {"b":2,"a":1} — which is explicitly not the case.

                                1. 2

                                  Where does it say those objects are equivalent?

                                  I put it to you that the RFC does not equate those objects, but says that JSON implementations that choose certain additíonal constraints - order-independence, a method of comparing strings, a method of comparing numbers - not required by the specification will equate those objects.

                                  The RFC is very carefully written to avoid giving an equivalence relation over objects.

                                  1. 1

                                    I understand “interoperable” to mean “[semantically] equivalent”.

                                    If this weren’t the case, then JSON would be practically useless, AFAICT.

                                    It’s not so complicated. The JSON payloads {"a":1,"b":2} and {"b":2,"a":1} must be parsed by every valid implementation into JSON objects which are equivalent. I hope (!) this isn’t controversial.

                                    1. 2

                                      The JSON payloads {“a”:1,“b”:2} and {“b”:2,“a”:1} must be parsed by every valid implementation into JSON objects which are equivalent

                                      Does JavaScript include a valid implementation of JSON? How would we test your assertion above in JavaScript?

                                      My proposal for testing this assertion would be this:

                                      const x = "{\"a\":1,\"b\":2}";
                                      const y = "{\"b\":2,\"a\":1}";
                                      if (JSON.parse(x) == JSON.parse(y)) {
                                        console.log("JSON implementation valid");
                                      } else {
                                        console.log("JSON implementation invalid");
                                      }
                                      

                                      Would you agree that this constitutes a valid test of the assertion?

                                      1. 1

                                        I’m no Javascript expert, so there may be details or corner cases at play in this specific bit of code. But, to generalize to pseudocode

                                        const x = `{"a":1,"b":2}`
                                        const y = `{"b":2,"a":1}`
                                        if parse(x) == parse(y) {
                                            log("valid")
                                        } else {
                                            log("invalid")
                                        }
                                        

                                        then yes I’d say this is exactly what I mean.

                                        edit: Yeah, of course JS defines == and === and etc. equality in very narrow terms, so those specific operators would say “false” and therefore wouldn’t apply. I’m referring to semantic equality, which I guess is particularly tricky in JS.

                                    2. 2

                                      I understand “interoperable” to mean “[semantically] equivalent”. If this weren’t the case, then JSON would be practically useless, AFAICT

                                      Exactly! Me too. I’m saying that every example of interoperability the spec talks about is couched in terms of “if your implementation chooses to do this, …”, i.e. adherence to the letter of the spec alone isn’t enough to get that interoperability. And the practical uselessness - yes, that’s what I believe. It’s fine when parties explicitly contract into a semantics overlaying the syntax of the RFC but all bets are off in cases of middleboxes, databases, query languages etc as far as the standard is concerned.

                                      The JSON payloads {“a”:1,“b”:2} and {“b”:2,“a”:1} must be parsed by every valid implementation into JSON objects which are equivalent. I hope (!) this isn’t controversial.

                                      This is of course a very sensible position, but it goes beyond the requirements of the RFC.

                                      1. 1

                                        This is of course a very sensible position, but it goes beyond the requirements of the RFC.

                                        I read the RFC as very unambiguously requiring the thing that I said, so if we don’t agree on that point, I guess we’ll agree to disagree.

                                2. 2

                                  A nitpick - if we wrote an encoding of a map as [[“a”,1],[“b”,2]] and another with the elements swapped I hope we should agree that the two lists contain the same set of name value mappings. Agreeing on the mappings when keys are disjoint (as required by the spec) is a different relation than equivalence of terms (carefully not defined by the spec), is what I’m trying to say.

                                  1. 2

                                    if we wrote an encoding of a map as [[“a”,1],[“b”,2]] and another with the elements swapped I hope we should agree that the two lists contain the same set of name value mappings.

                                    No, why would they? A name/value mapping clearly describes key: value pairs in an object, e.g. {"name":"value"}, nothing else.

                                    Maps (objects) are unordered by definition; arrays (lists, etc.) are ordered by definition. [["a",1],["b",2]] and [["b",2],["a",1]] are distinct; {"a":1,"b":2} and {"b":2,"a":1} are equivalent.

                                    1. 2

                                      They should be equivalent, on that we agree; but the standard on its own does not establish their equivalence. It explicitly allows for them to be distinguished.

                                      1. 1

                                        The RFC says that implementations must parse {"a":1,"b":2} and {"b":2,"a":1} to values which are interoperable. Of course implementations can keep the raw bytes and use them to differentiate the one from the other on that basis, but that’s unrelated to interoperability as expressed by the RFC. You know this isn’t really an interesting point to get into the weeds on, so I’ll bow out.

                                        edit: that’s from

                                        An object whose names are all unique is interoperable in the sense that all software implementations receiving that object will agree on the name-value mappings.

                                        1. 2

                                          I wish you’d point me to where in the RFC it says it “must” parse them identically, but fair enough.

        2. 3

          Yeah, something like this is necessary, but unfortunately there are multiple base64 encoding schemes 🥲 I like straight up hex encoding for this reason. No ambiguity, and not really that much bigger than base64, especially given that this stuff is almost always going through a gzipped HTTP pipe, anyway.

          1. 2

            I’ve done a lot of work in the area of base conversion (for example).

            For projects implementing a base 64, we suggest b64ut which is shorthand for RFC 4648 base 64 URI canonical with padding truncated.

            Base 64 is ~33% smaller than Hex. That savings was the chief motivating factor for Coze to migrate away from the less efficient Hex to base64. To address the issues with base 64, the stricter b64ut was defined.

            Here’s a small Go library that uses b64ut.

            base64 encoding schemes

            Here’s some notes comparing Hex and base 64 and the rational justifying b64ut. and a Github issue concerning non-canonical base 64

            A little more on b64ut

            b64ut (RFC 4648 base 64 URI canonical with padding truncated) is:

            1. RFC 4648 uses bucket conversion and not iterative divide by radix conversion.
            2. The RFC specifies two alphabets, URI unsafe and URI safe, respectively: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ and ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. b64ut uses the safe alphabet.

            2.1. On a tangent, the RFC’s alphabets are “out of order”. A more natural order, from a number perspective but also an ASCII perspective, is to start with 0, so e.g. 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz would have been a more natural alphabet, regardless, one of the two RFC’s alphabet is employed by b64ut. I use the more natural alphabet for all my bases when not using RFC base 64. 3. b64ut does not use padding characters, but since the encoding method adds padding, they are subsequently “truncated”.
            4. b64ut uses canonical encoding. There is only a single valid canonical encoding and decoding, and they align. For example, non-canonical systems may interpret hOk and hOl as the same value. Canonical decoding errors on the non-canonical encoding.

            There multiple RFC 4648 encoding schemes, and RFC 4648 only uses a single conversion method that we’ve termed a “bucket conversion” method. There is also the natural base conversion, which is produced by the “iterative divide by radix” method. Thankfully, natural and bucket conversion align when “buckets” (another technical term) are full and alphabets are in order. Otherwise, it does not align and encodings are mismatched.

            I made a tool to play play with natural base conversions and the RFC is avaiable under the “extras” tab.
            https://convert.zamicol.com

            Here’s an example converting a binary string to a non-RFC 4648 base 64: https://convert.zamicol.com/#?inAlph=01&in=10111010100010111010&outAlph=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!%2523

            1. 2

              To my eyes, the two alphabets in point 2 in your comment look identical. What am I missing?

              1. 2

                You’re right!

                1: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_

                2: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

                1 being URI safe and 2 URI unsafe.

            2. 1

              Base 64 is ~33% smaller than Hex.

              And what’s the difference when those strings are gzipped (as effectively every such string will be)?

              1. 2

                Gzip isn’t always available and when it is it requires extra processing. Yes, I’d try to use gzip when available.

        3. 1

          JSON is marshaled into UTF-8 which is easily signed or verified.

          1. 1

            Again, UTF-8 doesn’t guarantee what you’re suggesting, here. UTF-8 guarantees properties of individual runes (characters), not anything about the specific order of runes in a string.

            1. 1

              UTF-8 by definition is a series of ordered bytes.

              1. 1

                Yes, for individual characters (runes). And a UTF-8 string is a sequence of zero or more valid UTF-8 characters (runes). But the order of those runes in a string not relevant to the UTF-8 validity of that string.

                1. 1

                  Validity is tangential; the point is order, and UTF-8 is a series of ordered bytes.

                  I believe the following abstraction layer diagram fairly characterizes your view:

                  bits (ordered) -> 
                  byte (ordered) -> 
                  string/bytes (ordered) -> 
                  ASCII/UTF-8 (ordered) -> 
                  JSON (ordered except for object keys) 
                  

                  The jump from UTF-8 to JSON is where some order information may be considered to be lost in the narrow scope of object keys, while acknowledging all the rest of JSON is still ordered, including the explicitly ordered arrays.

                  Order information is present and is passed along this abstraction chain. Order information can only be considered absent after the UTF-8 abstraction layer. At the UTF-8 layer, all relevant order information is fully present.

                  1. 1

                    UTF-8 is a series of ordered bytes

                    This isn’t true in the sense that you mean. UTF-8 is an encoding format that guarantees a valid series of ordered bytes for individual characters (i.e. runes) — it doesn’t guarantee anything about the order of valid runes in a sequence of valid runes (i.e. a string).

                    At the UTF-8 layer, all relevant order information is fully present.

                    Within each individual character (rune), yes. Across multiple characters (runes) that form the string, no. That a string is UTF-8 provides guarantees about individual elements of that string only, it doesn’t provide any guarantee about the string as a whole, beyond that each element of the string is a valid UTF-8 character (rune).

                    Sending JSON payload bytes {"a":1}” does not guarantee the receiver will receive bytes {"a":1} exactly, they can just as well receive { "a": 1 } and the receiver must treat those payloads the same.

                    edit: This sub-thread is a great example of what I meant in my OP, for the record 😞

                    1. 1

                      UTF-8 is a series of ordered bytes. UTF-8 contains order information by definition.

                      That is the point: Order is present for UTF-8. Only after UTF-8 can order information finally start to be subtracted. Omitting order information at the UTF-8 abstraction layer is against UTF-8’s specification and is simply not permitted. Order information can only be subtracted after UTF-8.

                      JSON, by specification, marshals to and from UTF-8. In the very least, we have to acknowledge order information is available at the UTF-8 layer even if it is subtracted for JSON objects.

                      1. 1

                        UTF-8 is a series of ordered bytes. UTF-8 contains order information by definition.

                        You keep repeating this, but it isn’t true in the sense that you mean.

                        See

                        UTF-8 is an encoding for individual characters (runes). It defines a set of valid byte sequences for valid runes, and contains order information for the bytes comprising those valid runes. It does not define or guarantee or assert any kind of order information for strings, except insofar as a UTF-8 string is comprised of valid UTF-8 runes.

                        That JSON marshals to a UTF-8 encoded byte sequence does not mean that UTF-8 somehow enforces the order of all of the bytes in that byte sequence. Bytes in individual runes, yes; all the bytes in the complete byte sequence, no.

                        Order is present for UTF-8. Only after UTF-8 can order information finally start to be subtracted. Omitting order information at the UTF-8 abstraction layer is against UTF-8’s specification and is simply not permitted. Order information can only be subtracted after UTF-8.

                        I’m not sure what this means. UTF-8 asserts “order information” at the level of individual runes, not complete strings.

                        In the very least, we have to acknowledge order information is available at the UTF-8 layer even if it is subtracted for JSON objects.

                        UTF-8 does not provide any order information which is relevant to JSON payloads, except insofar that JSON payloads can reliably assume their keys and values are valid UTF-8 byte sequences.

                        1. 1

                          If UTF-8 was not ordered, the letters in this sentence would be out of order as this sentence itself is encoded in UTF-8.

                          UTF-8 by definition is ordered. This is a fundamental aspect of UTF-8. There’s nothing simpler that can be said because fundamental properties are the simplest bits of truth: UTF-8 is ordered. UTF-8 strings are a series of ordered bytes.

                          UTF-8 is a string. Order is significant for all strings. All strings are a series of ordered bytes.

                          UTF-8 does not provide any order information which is relevant to JSON payloads

                          Yes, it has order information.

                          JSON inherits order, especially arrays, from the previous abstraction layer, in this case, UTF-8. If this were not the case, how is order information known to JSON arrays, which are ordered? Where is the order information inherited from if not from the previous abstraction layer?

                          Edit:

                          UTF-8 asserts “order information” at the level of individual runes, not complete strings.

                          That is incorrect. UTF-8 by definition is a series of ordered bytes, which is the definition of a string. UTF-8 already exists in that paradigm. It does not need to further confine a property it already inherits. UTF-8 is a string encoding format.

                          1. 2

                            UTF-8 is a string encoding format.

                            https://en.wikipedia.org/wiki/UTF-8

                            UTF-8 is a variable-length character encoding standard used for electronic communication.

                            JSON inherits order, especially arrays, from the previous abstraction layer, in this case, UTF-8. If this were not the case, how is order information known to JSON arrays, which are ordered? Where is the order information inherited from if not from the previous abstraction layer?

                            The order of JSON arrays is part of the JSON specification. It’s completely unrelated to how JSON objects are marshaled to bytes, whether that’s in UTF-8 or any other encoding format.

                            Is the order of fields in a CSV file “inherited from” the encoding of that file?

                            If UTF-8 was not ordered, the letters in this sentence would be out of order as this sentence itself is encoded in UTF-8.

                            At this point I’m not sure how to respond in a way that will be productive. Apologies, and good luck.

                            1. 1

                              character encoding standard

                              Is in the context of strings. JSON doesn’t define UTF-8 as it’s encoding format for a single character. JSON defines UTF-8 as the character encoding format for strings. Strings are ordered. The entirety of UTF-8 is defined in the context of string encoding.

                              The order of JSON arrays is part of the JSON specification

                              When parsing a JSON array, where is the array’s order information known from? Of course, the source string contains the order. JSON parsers must store this order information for array as required by the spec. JSON inherits order from the incoming string.

                              1. 2

                                JSON defines arrays as ordered, and objects as unordered. The specific order of array elements in a JSON payload is meaningful (per the spec) and is guaranteed to be preserved, but the specific order of object keys is not meaningful and is not guaranteed to be preserved.

                                1. 1

                                  When JSON is unmarshalled from a string, where does an array’s order information come from? Does it come from the incoming string?

                                  1. 2

                                    When JSON is unmarshalled from a string, where does an array’s order information come from? Does it come from the incoming string?

                                    Yes, it does. But the important detail here is that JSON arrays have an ordering, whereas JSON maps don’t have an ordering. So when you encode (or transcode) a JSON payload, you have to preserve the order of values in arrays, but you don’t have to preserve the order of keys in objects.

                                    If you unmarshal the JSON payload {"a":[1,2]} to some value x, and the JSON payload {"a":[2,1]} to some value y of the same type, then x != y. But if you unmarshal the JSON payload {"a":1,"b":2} to some value x, and the JSON payload {"b":2,"a":1} to some value y of the same type, then x == y.

                                    Coze models the Pay field as a json.RawMessage, which is just the raw bytes as received. It also produces hashes over those bytes directly. But that means different pay object key order produces different hashes, which means key order impacts equivalence, which is no bueno.

                                    1. 1

                                      You can’t have it both ways. You can’t argue for JSON being both the pure abstract form and also a concrete string. JSON is not a string, JSON is an abstraction that’s serialized into a string; I agree with that. The abstract JSON is parsed from a concrete string, and strings carry order information. Obviously JSON is inheriting order from the abstraction layer above, which in this case is string (ITF-8). The order is there as shown arrays being ordered.

                                      When JSON is parsed from UTF-8, it is now in an abstract JSON form. When it’s serialized into UTF-8, it’s not the abstract JSON, it is now a string. It’s not both. I don’t see any issue categorizing JSON as a pure abstraction, however, the abstraction is solidified when serialized.

                                      JOSE, Matrix, Coze, PASETO all use UTF-8 ordering, and not only does it work well, but it is idiomatic.

                                      These tools do not verify or sign JSON, it signs and verifies strings, a critical distinction. After that processing, it may then be interpreted into JSON. These tools are a logical layer around JSON, and the JSON these tools processes, is JSON. In the example of Coze, not all JSON is Coze, but all Coze is JSON. That’s a logical hierarchy without hint of logical conflict. As I like to say, that makes too much sense.

                                      I fully acknowledge your “JSON objects are unordered” standpoint, but after all this time I have no hesitation saying it’s without merit. Even if that’s were the case, in that viewpoint these tools are not signing JSON, they’re signing strings. All cryptographic primitives sign strings, not abstract unserialized formats. And that too is no problem, far better, JSON defines the exact serialization format. That’s the idiomatic bridge permitting signing. It’s logical, idiomatic, ergonomic, it works, but most of all, it’s pragmatic.

                                      If JSON said in it’s spec, “JSON is an abstract data format that prohibits serialization” this would be a problem. But what use would be such a tool? If JSON said, “JSON objects are unordered and the JSON spec prohibits any order information being transmitted in its serialized form” that too would be a problem, but why would it ever have such a silly prohibition? To say, “can’t sign JSON because it’s unordered” is exactly that silly prohibition.

                                      1. 1

                                        When JSON is parsed from UTF-8, it is now in an abstract JSON form. When it’s serialized into UTF-8, it’s not the abstract JSON, it is now a string. It’s not both. I don’t see any issue categorizing JSON as a pure abstraction, however, the abstraction is solidified when serialized.

                                        My understanding of your position is: if user A serializes a JSON object to a specific sequence of (let’s say UTF-8 encoded) bytes (or, as you say, a string) and sends those bytes to user B, then — no matter how they are sent — the bytes that are received by B can be safely assumed to be identical to the bytes that were sent by A.

                                        Is that accurate?

                                        This assumption is true most of the time, but it’s not true always. How the bytes are sent is relevant. Bytes are not just bytes, they’re interpreted at every step along the way, based on one thing or another.

                                        If JSON serialized bytes are sent via a ZeroMQ connection without annotation, or over raw TCP, or whatever, then sure, it’s reasonable to assume they are opaque and won’t be modified.

                                        But if they’re sent as the body of an HTTP request with a Content-Type of application/json, then those bytes are no longer opaque, they are explicitly designated as JSON, and that changes the rules. Any intermediary is free to transform those bytes in any way which doesn’t violate the JSON spec and results in a payload which represents an equivalent abstract JSON object.

                                        These transformations are perfectly valid and acceptable and common, and they’re effectively impossible to detect or prevent by either the sender or the receiver.

                                        JOSE, Matrix, Coze, PASETO all use UTF-8 ordering, and not only does it work well, but it is idiomatic.

                                        The JSON form defined by JOSE represents signed/verifiable payloads as base64 encoded strings in the JSON object, not as JSON objects directly. This is a valid approach which I’m advocating for.

                                        Matrix says

                                        Signing an object … requires it to be encoded … using Canonical JSON, computing the signature for that sequence and then adding the signature to the original JSON object.

                                        Which means signatures are not made (or verified) over the raw JSON bytes produced by a stdlib encoder or received from the wire. Instead, those raw wire bytes are parsed into an abstract JSON object, that object is serialized via the canonical encoding by every signer/verifier, and those canonical serialized bytes are signed/verified. That’s another valid approach that I’m advocating for.

                                        The problem is when you treat the raw bytes from the wire as canonical, and sign/verify them directly. That isn’t valid, because those bytes are not stable.

                                        1. 1

                                          Coze speaks to Coze. Coze is JSON, JSON is not necessarily Coze. Coze is a superset, not a subset. Coze explicitly says that if a JSON parser ignores Coze, and does an Coze invalid transformation, that coze may be invalid.

                                          This is true for JOSE, Matrix, Coze, PASETO

                                          https://i.imgur.com/JYS7SFI.png

                                          The JSON form defined by JOSE represents signed/verifiable payloads as base64 encoded strings in the JSON object,

                                          Incorrect. There’s no logical difference between encoding to UTF-8 or base 64.

                                          This exactly is the mismatch. Since “JSON objects don’t define order” any JWT implementation may serialize payloads into any order. Base 64 isn’t a magic fix for this.

                                          Of course, all implementations serialize into an order. That’s what serialization does by definition. And it doesn’t matter what the serialization encoding is, by definition, any serialization performs exactly this operation.

                                          It’s so obvious, so foundational, so implicitly taken from granted, that fact is being overlooked.

      8. 1

        Regarding signing JSON, Peter and I have had a discussion going since March of this year.

        I think it’s fair to say of Peter’s position is that he’s concerned about signing JSON.

        Our position is signing JSON is not problematic at all. We sign JSON (Coze) without incident using simple canonicalization, which is straightforward and easy to implement (Go implementation and Javascript implementation.

      9. 1

        Do you have a recommendation for a (relatively) painless serialization format that is bijective without having to jump through too many hoops?

        1. 5

          Doesn’t CBOR provide this, as mentioned in this comment by @peterbourgon ?

          https://lobste.rs/s/wvi9xw/why_not_matrix#c_eh9ogd

          1. 1

            Yeah, it’s probably as good as it gets. I guess I still need to sort maps manually, and be careful which types I use, in order to get the same output for equivalent input data, but I might be misremembering things. I’ll have another look at the details, I remember that dag-cbor was pretty close to what I needed when I looked last time, but it only allows a very limited set of types.

        2. 4

          It’s really hard! Bijectivity itself is easy, just take the in-memory representation of a value, dump the bytes to a hex string, and Bob’s your uncle. But that assumes two things (at least) which probably aren’t gonna fly.

          First, that in-memory representation is probably only useful in the language you produced it from — and maybe even the specific version of that language you were using at the time. That makes it impractical to do any kind of SDK in any other language.

          Second, if you extend or refactor your type in any way, backwards compatibility (newer versions can use older values) requires an adapter for that original type. Annoying, but feasible. But forwards compatibility (older versions can use newer values) is only possible if you plan for it from the beginning.

          There are plenty of serialization formats which solve these problems: Thrift, Protobuf, Avro, even JSON (if you squint), many others. But throw in bijective as another requirement, and I think CBOR is the only one that comes to mind. I would love to learn about some others, if anyone knows of some!

          But it’s a properly hard problem. So hard, in fact, that any security-sensitive projects worth its salt will solve it by not having it in the first place. If you produce the signed (msg) bytes with a stable and deterministic encoder, and — critically — you send those bytes directly alongside the signature (sig) bytes as values in your messages, then there’s no ambiguity about which bytes have been signed, or which bytes need to be verified. Which means you can use whatever encoder you want for the messages themselves — JSON can re-order fields, insert or remove whitespace between elements, etc., but it can’t change the value of a (properly-encoded) string. And because you don’t need to decode the msg bytes in order to verify the sig, you don’t need full bijectivity, in either encoder.

        3. 2

          https://preserves.dev/ (Disclaimer: it’s something I started)

          1. 2

            Thanks!, This looks quite interesting! I’ll have a play with the Rust bindings and see what it can do. I haven’t looked in detail yet, but it looks like it plugs into serde, so it should be easy and cheap to try it out.

        4. 1

          I consider Coze’s approach as simple.

      10. 1

        We sign JSON and it works just fine.

        Coze uses strict base 64 encoding and canonicalization. That’s all that’s needed to make JSON and signing work.

        In Coze, the canonical form is generated by three steps:

        1. Omit fields not present in canon.
        2. Order fields by canon.
        3. Omit insignificant whitespace.

        That’s it.

        JSON + Canonicalization allows signing/verification. Canonicalization is the key.

    19. 20

      Why not punctuation? (/snark)

      I found Matrix usability to be quite frustrating.

      • Relentless spam in channels, that is hard to combat with people being blocked and coming back repeatedly.
      • A brutal user experience for using encrypted channels on multiple devices - the upstream matrix developers are aware of this. But if I login to Matrix, I get nagged about encrypted channels, and the user interface for accessing them is complete bobbins.
      • Bridges that only benefit the Matrix side of the bridge, and make all other communication tools objectively worse when a Matrix bridge is added. Every IRC, Telegram, Discord, or whatever channel I’ve been on has been degraded when Matrix bridges are added. Forcing people to leave, or forcing people over to Matrix.
      1. 9

        Relentless spam in channels, that is hard to combat with people being blocked and coming back repeatedly.

        I find this unsurprising but interesting to hear, after Mozilla’s stated big reason for shutting down its IRC server in favor of Matrix was that they thought Matrix would be easier to moderate.

        Bridges that only benefit the Matrix side of the bridge, and make all other communication tools objectively worse when a Matrix bridge is added. Every IRC, Telegram, Discord, or whatever channel I’ve been on has been degraded when Matrix bridges are added.

        I’m an IRC holdout in channels that are bridged with Matrix. Mostly it seems to work fine, and I’ve only ever seen it cause problems for the Matrix users and not for us IRC holdouts. ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

        1. 5

          I find this unsurprising but interesting to hear, after Mozilla’s stated big reason for shutting down its IRC server in favor of Matrix was that they thought Matrix would be easier to moderate.

          I can’t speak to Mozilla’s reasoning but IME the spam is more pernicious in Matrix. One issue is that you block people by user@homeserver whereas on IRC you can +b *!*@someaddr to target a specific person. Sure, IP blocks are imperfect but in practice it seems easier for a malicious human to get another open signup on some homeserver than a new IP address.

          Surprisingly enough there isn’t a room permission concerning who is allowed to upload images or attachments, so anybody can do it. Shock images just show right up rather than hiding behind a link. Zip files whose filenames allege that they contain very illegal content will have links just sitting there, hosted not on some dodgy domain but proxied via your own homeserver. If I had to pick, I’d take the IRC spam.

          1. 5

            Surprisingly enough there isn’t a room permission concerning who is allowed to upload images or attachments, so anybody can do it.

            You can’t even block them altogether. I wrote a spec to do that, but my implementation of the backend part is stuck in code review since March, and I haven’t even tried to write the frontend yet :/

          2. 1

            Knocking is in the spec and soon in element clients if I’ve understood correctly. That can help some with spam.

      2. 7

        I don’t see a lack of punctuation, just capital letters - a common cause of RSI. Carolingian minuscule only used capital letters for arbitrary emphasis: the “rules” are baseless and ignoring them therefore valid. I only use capitals here because otherwise people bias against what you’ve written (I actually stopped upcasing letters and correcting typos when I noticed Ken Thompson, Rob Pike et al. didn’t bother in emails and such. If they don’t care, why should I?).

        1. 16

          the “rules” are baseless and ignoring them therefore valid. I only [follow the rules] because otherwise people bias against [you]

          This is pretty much the fundamental principle of human communication. The rules may all be made up, but if you are unable or unwilling to follow them then you shouldn’t be too surprised when people don’t want to communicate with you.

          1. 4

            While I agree that effective communication requires (some) consensus, I don’t see how choice of capitalization affects understanding. I may find all-lowercase prose aesthetically objectionable, but that doesn’t mean that I can’t read something written in lowercase.

            I feel like a lot of prescritpivist arguments take for granted the notion that prose which defies convention is inherently less understandable. That part of their high ground is that you must do things As They Should Be Done to be understood. On the contrary I find prescriptivist arguments tend to concern prose that is understood, but looks bad. Only rarely do you get things that matter in some edge cases, e.g. the Oxford comma.

            1. 17

              The article provides some immediate examples - sentences like:

              first of all, a quick primer on what matrix actually is. even though element market matrix as the foundation of a chat app, it’s far more complicated than that.

              I literally had to go back and read multiple times because:

              • with no capital on “even” I mistook the full stop for a comma and misparsed the second sentence.
              • no capitals on proper nouns like Matrix and Element meant I had to do extra work to realise these were proper nouns. I literally got lost in what looked like a sea of normal nouns: “element market matrix”.

              If course, capitals don’t remove all ambiguities. But every little helps.

              I then decided that reading the article wasn’t worth it. There are other things to read that don’t make me do this extra work.

              1. 4

                I’m pretty sure that’s a typo; “element markets matrix” would make it clear as day.

                1. 8

                  It could be treating Element as a plural entity - https://english.stackexchange.com/questions/133105/organisation-singular-or-plural - as a Brit, “Element market Matrix” doesn’t seem wrong and wouldn’t have tripped me up.

                2. 3

                  An element markets matrix sounds like something a quantitative analyst at a hedge fund might use in an their code.

              2. 2

                Thank you! I was wondering what “element market matrix” meant. I assumed it was a typo.

            2. 4

              Oh, sure. I wasn’t attacking the choice of all-lowercase specifically, and I wasn’t trying to be prescriptive; I meant to point out that, pragmatically, if your communication style differs enough from others’ then you may have trouble getting them to listen to you. It’s irrelevant whether or not there’s an objective basis for the communication rules. These things evolve organically; it’s some kind of category error to treat them as if there’s an RFC somewhere listing the things that you MUST do when communicating in English.

              1. 4

                I think we agree on all points (presumably including aesthetics given how we’re writing these comments). I mostly am trying to say that as a society we tend to overvalue the importance of aesthetics in communication when the message can still be well understood. And I think we can do better.

                It would be nice if we could separate the message from the format so to speak, rather than tutting quietly when we see comic sans in a presentation.

                That said, I am being a little hypocritical here. I am guilty of disliking all-lowercase prose, word art, the use of the phrase “comprised of,” and so on.

            3. 3

              I mostly agree with you, but I do think it’s true that defiance of convention is inherently paying an understandability cost, at least among those who are used to the convention. Of course it can make up for that cost in other ways, including by being easier to write, or by being easier to read by those who are not used to the convention.

              Understandability isn’t a binary thing; interpretive labor is required for any communication. You can make your writing easier or harder for certain audiences to understand, partly by making choices about which conventions to follow.

              Այս մեկնաբանության մնացած մասը տեղադրել եմ հայերեն: Տարօրինակ ընտրություն է, բայց դա ընտրություն է, որը ես ազատ եմ կատարել: Եթե միալեզու հայերեն խոսող լինեիք, գուցե երախտապարտ կլինեիք, որ հեշտացրի ձեզ կարդալ սա:

              1. 4

                Այս մեկնաբանության […]

                I had to use Googe Translate’s autodetect feature to figure out that this was Armenian….

                It reminds me of the time when Armenia decided to change their DST schedule (to align with the big Russian Federation shift to permanent summer time) and posted it as an official government pronouncement in Armenian, with no translation. IIRC it took a while for the TZ maintainers to cotton on that the change was coming.

        2. 9

          Besides English, I also read Hindi and a bit of Punjabi. The scripts used to write Hindi and Punjabi – Devanagari and Gurumukhi, respectively – have no notion of letter casing. Most of the time this is not a problem. However, I sometimes come across unfamiliar words in Hindi/Punjabi that I can’t figure out how to parse, most often in technical or political writing, and it’s here that not having a notion of letter casing hurts my reading comprehension.

          Is that unfamiliar word the name of a person? A place? A brand? A language? Somebody’s title? The name of day or month or time period? Or just a word I haven’t come across before?

          So I think that in modern English written using the Latin script, letter cases carry a lot of meaning and aid comprehension. The rules might seem arbitrary if English is all you read, but they make a lot of sense. I love Hindi and Punjabi as much as I love English – and in some cases those other languages have more sensible defaults compared to English – but letter casing is one area where English has them beat.

          1. 2

            Another kind of semantic typography is using italics for foreign words, tho as I am writing this comment it occurs to me that it’s less common than it used to be, maybe? I know almost nothing about Japanese, but I gather that katakana is used in a similar way to italics.

            One of the things I don’t like about HTML is replacing the <i> tag with the <em> tag, because not all uses of italics are for emphasis.

            1. 3

              not all uses of italics are for emphasis.

              That is precisely why <i> was never removed from HTML. It was never even deprecated.

              1. 1

                True, though there was a big debate about it http://lachy.id.au/log/2007/05/b-and-i

            2. 2

              Funny you should mention that. Hindi and Punjabi don’t do italics either. You can italicize your words if you’re typing on a computer, but you never see italics in print.

            3. 1

              Surely you can still use <i> even if it’s frowned upon semantically?

              (Cue someone demanding semantic markup for titles, foreign words etc etc)

              1. 4

                <i> is the semantically correct element for titles and foreign words. It’s not frowned upon at all.

                1. 1

                  Certainly this is true for foreign words, but the element for titles of books etc. is cite. (Titles of persons have no semantic markup.)

              2. 2

                That is, in fact, what I do :-) if I am actually writing raw HTML which is pretty rare. Tho similar halfarsed “semantic” markup tends to occur in other formats too, sigh.

        3. 9

          i have a similar story!

          i read a lot of poetry, and always noticed that poems felt warmer and less formal when written in all-lowercase - some of my favorite poets frequently write in lowercase. bukowsky, ee cummings.

          it requires more careful composition, because Capital Letters serve as sentence indicators, but i find that breaking up text and speaking concisely in lowercase offers a superior rhythm. it also enhances capitalization’s use as an emphasis tool, rather than a tool used out of strict adherence to rule.

          writing is a fluid process & choosing a style is valid. i (a ~30 year old person) have been mistaken for a teenager many times because of my writing style. perhaps writing in lowercase is a font of youth! ;)

          1. 4

            perhaps writing in lowercase is a font of youth!

            This is accurate. If you type with capital letters and complete punctuation and it hasn’t been set in stone as your style of typing, zoomers will think you are weird, or angry at them.

            1. 5

              Even if it is your style of typing they may think you’re angry at them, I’ve found… and I’m not even much older than them.

              I go back and forth, and it’s often context-dependent. On IRC I’ve never heard of a shift key, ever. Over SMS and Messenger and Matrix and etc. it often depends on what device I’m writing from (phone auto-capitalizes, desktop of course does not) and what mood I’m in/how “proper” of a message I’m writing. In Slack it depends: DMs are always lowercased, company-wide blasts in public channels are written more like blog posts where I use Full Punctuation Minus This Style Of “Proper Noun”-esque Capitalization as emphasis.

              And that, friends, is how you really weird out the folks under about 25 /s (I don’t actually mind their all-lowercase ways; I sit dead-smack on the millennial-genz cusp and associate mostly with the older side of it, but whatever, do what makes you happy. That said I still don’t understand much of the slang vernacular of the folks who solidly fit into the Zoomer years - like age 21 and under. I’ve finally reached “get off my lawn” years, eh?)

              1. 11

                To me, it’s simple politeness. Humans read by doing pattern recognition through a neural network. The rules for sentence structure and grammar are somewhat arbitrary (in English more so than many human languages), but having a set of rules improves the pattern matching. If you use uncommon abbreviations or slang, don’t punctuate properly, or don’t capitalise, then you are saving yourself some effort at the expense of the reader. That is the fundamental core of bad manners: deciding that saving you time is worth costing someone else time. This is far worse in broadcast or multicast communication media, where the cost to the reader is multiplied by a large number of readers.

                If someone doesn’t put the effort in then this tells me either that they don’t care about the impact that it has on their readers or that they haven’t thought about it. Neither reflects well on them. There have been a lot of studies on the impact of punctuation, capitalisation and correct grammar on reading speed and comprehension, so it’s not like this is an area where you can just say that it doesn’t matter. The impact is often much large on non-native speakers and people on various neurodiversity axes.

              2. 3

                As a zoomer I can relate to that. I mean, look at me now forming Proper Sentences with Proper Punctuation.

                Capital letters serve as anchor points in large bodies of text, improving readability. Hence why they’re good in Lobsters comments or blog posts, where you usually elaborate on a topic in many, many sentences with some amount of fluff to make yourself seem more serious. (yes, this is a self-deprecating joke.)

                In my treehouse (website) I tend to avoid capital letters when starting sentences to create a more informal and friendly atmosphere. They wouldn’t improve readability that much anyways due to the tree-like structure, with each branch being maybe a sentence or two long.

                What I can add to the discussion is that I’m bilingual: I speak Polish and English, and Polish has diacritics. Just like capital letters, I tend to omit those when DMing friends (or when I’m too lazy to tap-and-hold letters on my phone; I don’t use autocorrect or auto-capitalization because it annoys me.) In Polish this can create some ambiguities - in zoomer circles we often joke about “sąd” vs “sad” which is “court” and “orchard” respectively. “spotkamy sie w sadzie” - did they mean “we’ll meet in court” or “we’ll meet in the orchard”?

                I do use diacritics at work though, because effective, unambiguous communication there is much more important than when DMing friends about silly things.

            2. 3

              zoomers

              As a millenial, we acted the same way with AIM, ICQ, YIM over capitalization + puncuation. I would venture as you mature this changes since a) you have to type with coworkers, bosses, non-friendlies so it forms a habit & b) a made you realize what you said is largely easier to understand to a broader audience.

              A possible addition/alternative could be that phone keyboards generally capitalize automatically & you got used to the aesthetic.

        4. 4

          Many scripts aren’t bicameral either. Those are limited to Latin, Cyrillic, Greek, Coptic, Armenian, Glagolitic, Adlam, Warang Citi, Cherokee, Garay, Zaghawa, Osage, Vithkuqi, Old Hungarian, and Deseret scripts.

          Interestingly a lot of programming languages differentiate code based on casing–which is fine for reading in these scripts, but would prevent usage of other scripts in code that should otherwise be considered valid (which is to say a language like PureScript can allow you to only have variable names in the aforementioned scripts, otherwise without a case, it can’t know if it’s a variable foo versus type Foo).

          In written languages tho, English’s capitalization of proper nouns quickly let you know it’s not vocab you need to know but just a name; when I read Thai, I have to ask what words mean just to be told, “it’s just a name” which does make reading harder. Andit’seventrickiertoreadsincethereisneitherspacesbetweenwordsnorpunctuation (altho folks are supposed to use zero-width spaces, almost no one does since you can’t see them and mainly helps spelling correction, newlining, etc.; spaces separate clauses).

          1. 1

            Interesting. Thank you.

            a language like PureScript can allow you to only have variable names in the aforementioned scripts, otherwise without a case, it can’t know if it’s a variable foo versus type Foo

            Doesn’t it have a type-level subsystem that’s Turing complete? So then you don’t need variables.

      3. 3

        imo matrix introducing end-to-end encryption was a huge mistake. building a distributed modern chat platform is hard enough without having to worry about the insane amount of complexity e2ee represents. that complexity is also passed down to clients, which must implement the e2e spec: https://spec.matrix.org/v1.8/client-server-api/#end-to-end-encryption

        read through the spec - it’s huge and plum full of caveats. and in the end, your matrix server provider could easily drop a little javascript into your client session to exfiltrate your messages anyways.

        it reminds me of “end to end encrypted email”, which is mostly a sham too. all of the security in the world doesn’t matter when you’re ultimately trusting a service provider. the only way to assure real end-to-end security is to do it yourself (i use age these days, but you might use PGP if you’re a massochist).

        EDIT: i say this as an avid user & maintainer of the cyberia.club matrix system for ~4 years. i like matrix & want it to flourish.

        1. 4

          Are there actually multiple implementations of E2EE matrix clients? Last I checked there was one half-baked-looking weechat plugin that was impossible to compile, and Element.

          1. 6

            It all just works for me between Cinny, FluffyChat and Fractal.

            1. 3

              same with nheko on top

    20. 1

      It’s wild to me that the best option for managing your authentication on the Internet is to give your credentials to a small company like 1Password or LastPass. How did that happen? I would sure be more comfortable if it were Google, Microsoft, Mozilla, or Apple managing them for me. Those companies have their problems but I trust they have a better chance at making a product secure.

      (I know the answer, it’s business questions. Going back to Passport no one trusts the big companies with all the keys to the kingdom. But the current state of things where some tiny random company has them isn’t good either.)

      1. 5

        It’s wild to me that the best option for managing your authentication on the Internet is to give your credentials to a small company like 1Password or LastPass. How did that happen?

        But it’s not? There’s BitWarden, where you can self-host either their server, or VaultWarden. There’s pass, there’s KeePassXC, just to name a few.

        (Btw, LastPass ain’t a small company. It’s developed by GoTo, who had over 3k employees and over 1.2 billion USD revenue in 2019 - I imagine both grew substantially since. That is to say, it is bigger and has more revenue than Twitter currently. Granted, they have more products in their portfolio than LastPass, but it’s no small company by any means.)

        I would sure be more comfortable if it were Google, Microsoft, Mozilla, or Apple managing them for me. Those companies have their problems but I trust they have a better chance at making a product secure.

        I’d rather not trust any of them. My passwords are safest when they’re not in the cloud at all.

      2. 3

        I would sure be more comfortable if it were […] Apple managing them for me.

        If you’re all-in on Apple’s operating systems, this is pretty much possible. iCloud can store your passwords (and optionally also TOTP data) in the cloud, end-to-end encrypted.

        1. 1

          Don’t you turn 2FA back into 1FA if you keep TOTP together with passwords?

          1. 1

            Yeah, in some ways. It’s a trade-off between security and convenience. There’s an ongoing debate about it, which I don’t have a strong opinion in. I was mostly trying to make the point that iCloud Keychain is relatively full-featured, and can replace apps like LastPass for many users.

          2. 1

            The way I see it, it converts your static 1FA login into a time-sensitive 1FA login, which is still strict better than ONLY a password.

            It does mean you have to protect your PC against infiltration as otherwise someone could sniff the TOTP password… but frankly that’s already true when you rely on a password manager.

          3. 1

            It depends on the threat model. The threat model for TOTP is normally that the endpoint is compromised. For example, if there’s a browser vulnerability then someone could exfiltrate you password but if they steal a TOTP password then they need to use it immediately or it stops working. This become 1FA if an attacker who can compromise the bit of the channel that holds the password can also compromise the TOTP codes.

            I wasn’t aware keychain worked with TOTP, but if it does then it depends somewhat how they implement it. For WebAuthn, the keys never leave the secure element (well, not as plaintext. The iCould thing does key exchange between two SEs and sends the encrypted keys from one to another) and so someone that compromises the browser can do more signing as the user, though I believe each request has to be accompanied by a fingerprint or face scan and there’s a secure path from the fingerprint reader to the SE. If TOTP is handled there as well, it’s probably fine. If it’s handled in the keychain daemon then someone with root access on the client can get the TOTP codes as well as the password.

      3. 2

        I think Googles, and Apples “solution” to this problem was that companies should just let them handle the authentication AKA Login with Google/Apple. Considering how trigger happy both are for banning users, I wouldn’t trust that either (and not to mention the privacy concerns).

        The big players are disinterested because its all risk, low reward. One screw up and the entire brand’s security reputation is permanently trashed (like LastPass), and you’ll probably get class actioned or arbitration to hell and back. Do it well and you get a slew of customer support burden and a small stream of cash at best (compared to their main money makers).

        1. 3

          Apple’s iCloud Keychain works(even on Windows and under Chrome as an extension) as a general purpose password manager. I don’t use it, so I can’t comment on how well it works, but it probably works just fine.

        2. 3

          PassKeys don’t require a third party gatekeeper. They are just private keys stored on your client (ideally in some form of HSM). The bit that does require Apple currently is syncing them between devices. This is quite hard to do securely because you want one HSM to provide an encrypted copy of the keys to another HSM (encrypted with a public key corresponding to the other HSM’s private key) but you don’t want that to be something an attacker can do if they have temporary access to your device. Apple does some key exchange via the iCloud infrastructure and requires you to be logged in on both devices, so at least an attacker would need to compromise both and persuade you to go through the biometric auto steps on both. I’d love to see an open standard for this but I have no idea how to design something that is both usable and secure for this.

        3. 2

          I think Googles, and Apples “solution” to this problem was that companies should just let them handle the authentication AKA Login with Google/Apple.

          Maybe it’s that… but they’re also both pushing passkeys IIRC.

      4. 1

        I would sure be more comfortable if it were Google, Microsoft, Mozilla, or Apple managing them for me. Those companies have their problems but I trust they have a better chance at making a product secure.

        Others have noted that Apple does offer a password manager, at least “If you’re all-in on Apple”. I’ll note that so does Google (if one uses Chrome everywhere), I think Mozilla at least used to (although I wouldn’t put them in a list of big companies with strong security), and I assume Microsoft at least has something for enterprise use.