1. 29
  1. 4

    As browsers expose more APIs, exposing the Unicode data they already need for their own regexp implementations, sorting, etc. seems like a reasonable thing to want. (There are already substantial i18n APIs.)

    Many wasm programs need to process text and all need the same data to do so, and sometimes following the browser’s idea of Unicode may be an added benefit–gets an old program ‘free’ Unicode version updates, or conversely helps grapheme clustering match the browser’s when the browser is not up to date.

    1. 2

      (Author here) I would love this. I think that even just improving the existing Intl.Collator API could do good enough. The problem with it currently, from what I have seen, is that it’s actual behavior is not formally specified. That is, it works well if you’re, say, trying to sort strings in the user’s locale - which is what the API was designed to do - but if you expect those results to be consistent across browsers you’re out of luck and will quickly run into issues like[0]

      If the behavior was consistent across browsers, I think that’d be good enough to rely on the Intl APIs in general. But yeah, differing behavior for say a regexp engine across browsers is.. well.. not desirable, to say the least.

      [0] https://stackoverflow.com/questions/33919257/sorting-strings-with-punctuation-using-intl-collator-is-inconsistent-across-brow

      1. 1

        Oh, wow, Collator being almost right is kind of a maddening snatching-defeat-from-the-jaws-of-victory situation.

        Zorex looks cool! A regex-like language that does things Everyone Knows Regexes Can’t Do seems like it could be useful in a lot of places.

    2. 2

      I’m really curious how well one of the neural network compression tools would work for something like this. It feels like exactly the kind of thing that they’re designed for.