So TLDR is that emacs-LSP sends invalid utf16 character offsets on text changes. It sends utf8 offsets, when the LSP spec says utf16. Rust-Analyzer tries to use these invalid offsets to manipulate internal data and panics, because it ends up trying to change stuff inside a multibyte-character. That only happens on certain multibyte stuff for obvious reasons (bottom emoji). Fix would be UTF-8 OPT-IN (new LSP option), or sending correct UTF16 offsets. Fixes exist, but are in upstreaming limbo. Hope I got this right.
Also fun from the article: unicode normalization/denormalization (“é” can be one or two codepoints, if you like combining diacritics). Big emojis are small emojis zero-width-joined. ECMAScript can’t decide between UTF-16 code units and Unicode codepoints (indexing vs iteration).
lsp-mode’s auto-download doesn’t seem to work if you use rustic, at least, and it falls back to RLS, which is completely deprecated. Emacs considers that “characters” are 22-bit-wide, just like Unicode (21 bits) + 1 bit to make room for “raw bytes”.
There’s also a lighting review of zig hidden in plain sight.
I really liked the unicode-utf8-utf16 visualizations. I wish that would exist as some kind of website where you can look up stuff and see how it is represented in different unicode encodings.
I definitely had to notice the Zig intermezzo - the code required is very verbose. Reminded me of your “Twitch” Article mentioning burnout, and gave me the thought how you’re probably also “locked in” to rust content, as the thing that made your articles self sustaining.
I haven’t really used LSP but the gist I get is that everyone is vaguely embarrassed by the fact that the spec says everything has to be UTF-16, and instead of fixing it, most people just kind of pretend it’s already been fixed because the alternative of actually having to think about non-UTF8 text gives people traumatic flashbacks to the 1990s and no one has time for that?
Like… the specified behavior is also hilariously the wrong thing, so no one wants to actually defend it and insist on following the spec?
[edit: hadn’t made it to the end of the article where it says the spec now allows negotiation on this as of a few months ago]
Not sure about “instead of fixing it” bit. rust-analyzer (following clangd) supported sane utf-8 offsets since forever. I’d personally be perfectly happy if clients supported only utf8 (which would be a de-jure violation of the spec), as long as they properly advertise utf8 support.
That’s what I meant by pretending it’s already been fixed; ignoring the spec and doing the right thing anyway by counting codepoints instead of using UTF-16 offsets. IMO the right thing to do, but at the same time “please ignore the spec because it’s very stupid” is a hard argument to make.
I actually made that argument a while ago, when every LSP implementation started to realize that the spec mandated UTF-16 offsets (while mandating UTF-8 document encoding…). At that point in time most implementations were either using UTF-8 byte offsets or codepoint offsets and we could have unified on something sensible while pressuring Microsoft to fix the spec to what implementations were actually doing instead of what happened to be convenient to them. Unfortunately that did not happen and every LSP implementation now has to contain the same unnecessary complexity to be compliant. The recent change to LSP added support for alternate offset encoding but still mandates that UTF-16 must be supported.
It’s a bit hard to fix the spec, as it isn’t collectively maintained, and the upstream isn’t particularly responsive. Historically, just doing the thing ahead of the spec, and using the “this already is how the things work in the wild” was the most effective way to move forward. Which sort-of what happened with position-encoding as well!
UTF-16 code units are still how quite a lot of things in the web platform are specified, largely because that’s JavaScript’s closest thing to a “character” type, and JS is the native language of that platform. So things like setting a max length of 10 on an HTML form input means “10 UTF-16 code units”, not “10 bytes” or “10 code points” or “10 graphemes”.
Though curiously some parsing algorithms are still specified in terms of code points and not code units, which means implementing them in a “just use UTF-8” language can be trickier than expected. For example, the HTML5 legacy color parsing algorithm (the one that can turn any string, even “chucknorris”, into an RGB color value) requires identifying code points at specific positions as well as being able to perform replacements based on code point values.
UTF-16 code units are still how quite a lot of things in the web platform are specified
And that would be relevant, if the context had anything at all to do with the web rather than a supposedly language-agnostic protocol! But they made a big mistake and let implementation details leak into the spec, and it’s taken them years to admit it.
As the post explains, LSP is based on JSON-RPC, which ties it firmly back to the web domain (JSON-RPC being built around JSON, which in turn is built on top of JavaScript). Plus LSP itself was originally developed for VS Code, an Electron app, which likely had some influence on the selection of things like JSON, JS, etc.
I’m not all sure it is an “implementation detail”, unless JSON itself is considered a detail that shouldn’t leak into the spec. Which would be weird, since usually the data format is supposed to be part of a spec.
(where I’m going with this, ultimately, is JSON being more complex, still, than people generally realize – even though the 2017 JSON RFC waved its hands in the direction of saying JSON ought to be UTF-8, it did so in a way that left loopholes for protocols like LSP to fall back to “JSON is JS is UTF-16”, plus the RFC itself still has UTF-16-isms in it, notably in the escape syntax which requires use of surrogate pairs for non-BMP code points)
I phrased that badly – the idea was just to point out that JSON explicitly allows for a protocol to do UTF-16 (or, really, any other encoding), and that JSON’s inherent origin in the web domain means “web domain is irrelevant” is wrong.
LSP is caring about UTF-16-y things in a different way.
But this is turning into a super-deep tangent just for what I meant as a throwaway comment about how UTF-16 isn’t as dead as people seem to want to think it is.
I can’t speak to the Rust side of this post, but as far as Emacs goes two really useful key combos to know are:
C-x 8 RET (insert-char)
C-u C-x = (prefixed what-cursor-position)
The former lets you insert various unicode character by either name or hex code. When you hit it, it will prompt you for which character and you can type something like SMI and hit tab to see a list of smiles.
The later, will pop up a help buffer showing you tons of information about the character that the point is on: the unicode codepoint, name, and categories, the encoding, the font that the displayed glyph was actually pulled from, any properties like syntax highlighting, etc. If you do it without the C-u prefix and hit only C-x = then you just get a short one-line summary in the modeline.
Another thing I use surprisingly often is a convenience function that runs M-x occur RET [[:nonascii:]] RET, to quickly check for and locate things like stray curly quotes that have been inadvertently pasted into a file.
What’s surprising (or maybe expected, but still interesting) is that there’s isn’t a PR against lsp-mode which fixes the issue. The issue isn’t something super-involved algorithmically, a bunch of people hit it, and this is also a scratch-your-own itch dev tool, not some obscure complicated sdk for something. It feels like an abstract someone should’ve fixed it years ago.
It feels like there’s a huge gap between “uses emacs and can cobble together an init.el based on wiki” and “can contribute non-trivial patches to packages”. I am certainly only in the former category!
The first half of this article is basically why I scorn the use of rust-analyzer or any other language-server thing with nvim. The amount of work you have to put into it to get the actual features you’d want out of it is just not any fun, and the entire workflow breaks and has to be re-derived at least every 6 months.
I’m still using vim, but adding LSP servers there is pretty trivial. I installed the ALE plugin and then add one line of config telling it where the binary for the server and another telling it which file types to use that server for. After that, I get error reporting, autocompletion, and so on, in vim. My entire .vimrc is a bit too big to fit in a single default terminal window without scrolling, but only because I share it across multiple machines on different platforms and it has a bunch of conditionals to deal with that (and I want it to use a different clangd for CHERIoT).
So TLDR is that emacs-LSP sends invalid utf16 character offsets on text changes. It sends utf8 offsets, when the LSP spec says utf16. Rust-Analyzer tries to use these invalid offsets to manipulate internal data and panics, because it ends up trying to change stuff inside a multibyte-character. That only happens on certain multibyte stuff for obvious reasons (bottom emoji). Fix would be UTF-8 OPT-IN (new LSP option), or sending correct UTF16 offsets. Fixes exist, but are in upstreaming limbo. Hope I got this right.
That is correct.
Also fun from the article: unicode normalization/denormalization (“é” can be one or two codepoints, if you like combining diacritics). Big emojis are small emojis zero-width-joined. ECMAScript can’t decide between UTF-16 code units and Unicode codepoints (indexing vs iteration).
lsp-mode’s auto-download doesn’t seem to work if you use rustic, at least, and it falls back to RLS, which is completely deprecated. Emacs considers that “characters” are 22-bit-wide, just like Unicode (21 bits) + 1 bit to make room for “raw bytes”.
There’s also a lighting review of zig hidden in plain sight.
I really liked the unicode-utf8-utf16 visualizations. I wish that would exist as some kind of website where you can look up stuff and see how it is represented in different unicode encodings.
I definitely had to notice the Zig intermezzo - the code required is very verbose. Reminded me of your “Twitch” Article mentioning burnout, and gave me the thought how you’re probably also “locked in” to rust content, as the thing that made your articles self sustaining.
I haven’t really used LSP but the gist I get is that everyone is vaguely embarrassed by the fact that the spec says everything has to be UTF-16, and instead of fixing it, most people just kind of pretend it’s already been fixed because the alternative of actually having to think about non-UTF8 text gives people traumatic flashbacks to the 1990s and no one has time for that?
Like… the specified behavior is also hilariously the wrong thing, so no one wants to actually defend it and insist on following the spec?
[edit: hadn’t made it to the end of the article where it says the spec now allows negotiation on this as of a few months ago]
Not sure about “instead of fixing it” bit. rust-analyzer (following clangd) supported sane utf-8 offsets since forever. I’d personally be perfectly happy if clients supported only utf8 (which would be a de-jure violation of the spec), as long as they properly advertise utf8 support.
That’s what I meant by pretending it’s already been fixed; ignoring the spec and doing the right thing anyway by counting codepoints instead of using UTF-16 offsets. IMO the right thing to do, but at the same time “please ignore the spec because it’s very stupid” is a hard argument to make.
I actually made that argument a while ago, when every LSP implementation started to realize that the spec mandated UTF-16 offsets (while mandating UTF-8 document encoding…). At that point in time most implementations were either using UTF-8 byte offsets or codepoint offsets and we could have unified on something sensible while pressuring Microsoft to fix the spec to what implementations were actually doing instead of what happened to be convenient to them. Unfortunately that did not happen and every LSP implementation now has to contain the same unnecessary complexity to be compliant. The recent change to LSP added support for alternate offset encoding but still mandates that UTF-16 must be supported.
It’s a bit hard to fix the spec, as it isn’t collectively maintained, and the upstream isn’t particularly responsive. Historically, just doing the thing ahead of the spec, and using the “this already is how the things work in the wild” was the most effective way to move forward. Which sort-of what happened with position-encoding as well!
Sounds like the LSP protocol itself is another horror show of outdated definitions and reality vs spec fights.
Well it was invented by Microsoft.
UTF-16 code units are still how quite a lot of things in the web platform are specified, largely because that’s JavaScript’s closest thing to a “character” type, and JS is the native language of that platform. So things like setting a max length of 10 on an HTML form input means “10 UTF-16 code units”, not “10 bytes” or “10 code points” or “10 graphemes”.
Though curiously some parsing algorithms are still specified in terms of code points and not code units, which means implementing them in a “just use UTF-8” language can be trickier than expected. For example, the HTML5 legacy color parsing algorithm (the one that can turn any string, even “chucknorris”, into an RGB color value) requires identifying code points at specific positions as well as being able to perform replacements based on code point values.
And that would be relevant, if the context had anything at all to do with the web rather than a supposedly language-agnostic protocol! But they made a big mistake and let implementation details leak into the spec, and it’s taken them years to admit it.
As the post explains, LSP is based on JSON-RPC, which ties it firmly back to the web domain (JSON-RPC being built around JSON, which in turn is built on top of JavaScript). Plus LSP itself was originally developed for VS Code, an Electron app, which likely had some influence on the selection of things like JSON, JS, etc.
That’s what “let implementation details leak into the spec” means.
I’m not all sure it is an “implementation detail”, unless JSON itself is considered a detail that shouldn’t leak into the spec. Which would be weird, since usually the data format is supposed to be part of a spec.
(where I’m going with this, ultimately, is JSON being more complex, still, than people generally realize – even though the 2017 JSON RFC waved its hands in the direction of saying JSON ought to be UTF-8, it did so in a way that left loopholes for protocols like LSP to fall back to “JSON is JS is UTF-16”, plus the RFC itself still has UTF-16-isms in it, notably in the escape syntax which requires use of surrogate pairs for non-BMP code points)
LSP doesn’t not encode JSON in UTF-16. It uses UTF-8 on the wire.
I phrased that badly – the idea was just to point out that JSON explicitly allows for a protocol to do UTF-16 (or, really, any other encoding), and that JSON’s inherent origin in the web domain means “web domain is irrelevant” is wrong.
LSP is caring about UTF-16-y things in a different way.
But this is turning into a super-deep tangent just for what I meant as a throwaway comment about how UTF-16 isn’t as dead as people seem to want to think it is.
The spec is supposed to let you switch between UTF8 & UTF16, but it’s an LSP extension & both ends have to support it.
I can’t speak to the Rust side of this post, but as far as Emacs goes two really useful key combos to know are:
C-x 8 RET
(insert-char)C-u C-x =
(prefixed what-cursor-position)The former lets you insert various unicode character by either name or hex code. When you hit it, it will prompt you for which character and you can type something like
SMI
and hit tab to see a list of smiles.The later, will pop up a help buffer showing you tons of information about the character that the point is on: the unicode codepoint, name, and categories, the encoding, the font that the displayed glyph was actually pulled from, any properties like syntax highlighting, etc. If you do it without the
C-u
prefix and hit onlyC-x =
then you just get a short one-line summary in the modeline.Another thing I use surprisingly often is a convenience function that runs
M-x occur RET [[:nonascii:]] RET
, to quickly check for and locate things like stray curly quotes that have been inadvertently pasted into a file.What’s surprising (or maybe expected, but still interesting) is that there’s isn’t a PR against lsp-mode which fixes the issue. The issue isn’t something super-involved algorithmically, a bunch of people hit it, and this is also a scratch-your-own itch dev tool, not some obscure complicated sdk for something. It feels like an abstract someone should’ve fixed it years ago.
It feels like there’s a huge gap between “uses emacs and can cobble together an init.el based on wiki” and “can contribute non-trivial patches to packages”. I am certainly only in the former category!
There is such a PR, it just hasn’t been merged yet. I imagine it’s not optimal in some way, otherwise it would’ve been an easy merge — but this is speculation, I’ve read it several times and am still trying to fully make sense of it: https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210
It’s not a PR, it’s a random WIP commit in maintainers branch. The difference is huge:
There are cases where maintainers are unresponsive to work being done elsewhere for good/bad/life reasons, but I wouldn’t classify this case as such.
Both are merged now.
The first half of this article is basically why I scorn the use of
rust-analyzer
or any other language-server thing with nvim. The amount of work you have to put into it to get the actual features you’d want out of it is just not any fun, and the entire workflow breaks and has to be re-derived at least every 6 months.Neovim 0.8 added some new features aimed at making “vanilla” (i.e. sans plugins) LSP setup a bit easier: https://zignar.net/2022/10/01/new-lsp-features-in-neovim-08/
Good to know, though I also tried out Helix a little and I’m a little taken by it so far. We’ll see if I stick with it.
I’m still using vim, but adding LSP servers there is pretty trivial. I installed the ALE plugin and then add one line of config telling it where the binary for the server and another telling it which file types to use that server for. After that, I get error reporting, autocompletion, and so on, in vim. My entire .vimrc is a bit too big to fit in a single default terminal window without scrolling, but only because I share it across multiple machines on different platforms and it has a bunch of conditionals to deal with that (and I want it to use a different clangd for CHERIoT).
I have this hit bug too. So annoying. I did not however write an epic blog post about it because oh god.