Some years ago (2019, it turns out!) I noticed my mail client was mangling payroll emails from Xero. Turns out, they were specifying UTF-8 text in the MIME type, then cheerfully delivering UTF-16 text, presumably from a .NET service. Took a while but that eventually got escalated to the devs where it was (I assume) a simple fix.
I suspect this must happen a lot at Microsoft shops when interfacing with the real world ;)
I think this has nothing to do with Windows and Microsoft: the root cause here is that JS strings are UTF-16, and VS Code uses Electron — not something pioneered at Microsoft.
Yeah, but everyone else acknowledges it as a mistake you need to work around as soon as possible and would never allow a design bug like that to make it into the final spec.
If I wanted to solve general state synchronization problem, I’d spend some time studying https://github.com/JetBrains/rd in detail. My prior is high that that is good stuff.
It addresses the problem by providing the tools needed to build editor experiences which can be powered by semantic and syntactic understanding of arbitrary languages
It ensures that a language only needs to be defined once to be able to power many if not most language-aware experiences
It focuses on defining semantics over presentation, because specific presentations are arbitrary and unstable
It is a truly open project
It has a small and highly focused specification
It uses immutable data structures as its source of truth, making concurrency easy
State synchronization is not needed at all
Text encoding is compatible with Javascript
Enables incredibly rich interaction model that will make VSCode look like a child’s toy
It’s interesting that you say it’s better to model the semantics of the language. The LSP consensus seems to be that it’s better to model what the editor displays.
Yeah, I wish I was wrong on this one, but I am skeptical about the usefulness of cross-language semantics models.
This is based on success of LLVM in building a universal backend and on un-success of JetBrains in building a universal front-end. There are a couple of semantic abstractions in IntelliJ, and they work well for any language, as long as this language is Java.
That’s the thing: most language are the same from the back end, but fundamental differences in front-ends are precisely the reason we have different languages.
I wish LSP was less of a least commmon denominator, that it was easier to extend it with language-specific and editor-specific features. Which is the case technically, but not organizationally.
Basically, someone unaffiliated with Microsoft should maintain an lsp-extras repository, which all servers and clients use to document their custom extensions, and which also surfaces common extensions shared by multiple parties.
This isn’t happening for a neutral reason that solving coordination problem is hard, and for a bad reason because one would think that one could just contribute upstream.
I’ve just realized the irony that, technically, LSP is extremely well positioned to get embraced, extended and extinguished, but that the OSS community is incapable of taking advantage of that (I am of course one of the responsible non-doers here)
There are a lot of things that are common across languages. Is this token a keyword or an identifier? If it’s an identifier, does it name a type or a value? If it names a type, your language might have different categories of type, what are they?
I want to be able to configure my editor to say ‘keywords are red, comments are green’ (for example) and have the language-specific logic annotate tokens with enough information that the editor can put them in these generic categories and expose some UI that says ‘this language has this extra category of thing. It’s a special kind of type, how do you want to render that?’. By default, maybe Objective-C classes and ML records are both highlighted as types, but the LSP-equivalent for each language could tell the front end that it has these extra kinds of type and maybe I want to have finer-grained colouring.
If it’s an identifier, does it name a type or a value?
And they you have Zig, where the answer is “yes”, and the question you want to ask “is this comptime?”.
I want to be able to configure my editor to say ‘keywords are red, comments are green’
That’s presentation, and that’s exactly how LSP works (and, how I argue, the similar thing should work). There isn’t a request that returns an annotated syntax tree. There’s a request that returns syntax highlighting information, which tells you that, for the purposes of syntax highlighting, this token should be considered a keyword. This makes it very cheap to model extensions, like having moves underlined in your editor. Modeling moves at the syntax highlighting level is easy: you say this range is annotated with a custom tag rust.move, and then a rust-knowing color theme can pick it up. Modeling moves in a language-agnostic semantic manner is much harder, because the semantic model likely just doesn’t have an obvious place to describe what a move is.
That’s the thing, you model syntax highlighting, and not the ast, in the protocol.
Isn’t the syntax highlighting model kind of just an alternative representation of the AST? And we treat that data as kind of untrustworthy and second-rate. The user is allowed to see it but not interact with it. Comments are highlighted grey, but you can’t do a code search that omits results in comments.
I believe there are reasonable solutions that involve only a single model, and I think there are some very strong benefits that come from offering APIs that deal directly in data structures. The alternative is to keep all the sources of truth behind a facade or REST API, which creates kinds of systems which have nightmarish state syncing concerns.
syntax highlighting needs semantics, it’s not syntax driven.
there’s not necessary a single well-defined AST for a particular region of text. Three cases in Rust where you have several ASTs:
the same file could be included several times into compilation, each giving a separate AST
a documentation comment, containing markdown, containing Rust code (a single token here belongs to three different ASTs)
an argument of a macro is both a token tree (just a balanced sequence of parenthesis), and whatever syntax node it is parsed as in the expansion of the macro.
My gripe with syntax highlighting as a whole features is that it takes the complicated kinds of situations you’re talking about and flattens them all out. When you have no semantic model you’re left taking text ranges and trying to map them onto something (e.g. a character -> color), which is generally going to be an oversimplification for exactly the reasons you are mentioning.
When you do have a semantic model it’s not any trouble for the model to consist of a Rust tree parsed from stringish content stored in a markdown tree which itself is parsed from stringish doc comment content in a Rust tree.
If you insist that it isn’t possible to highlight without semantics you’re asserting that the best possible editor would be too stupid to understand literally the first thing about syntax or code, which doesn’t square with the most basic common sense. If I’m fluent in a language I won’t have to keep telephoning my friend Steve to say, “HEY STEVE, COULD YOU REMIND ME AGAIN IS ‘GET’ A VERB OR A NOUN IN ENGLISH”
We are having this conversation on the internet, in a web browser – the universal-est frontend of them all.
This is why I am not convinced by the argument that the intricacies present within the huge tree of families and subfamilies of related languages should mean that it is not possible to build any reusable semantic tools. The only real question is how reusable any given tool would be. In a workshop full of tools it is not strange that many of them have use only in specific circumstances, or are even perhaps tools that are only useful in the context of interacting with other purpose-specific tools.
The power of something like web frontend is the tremendous multicultural ecosystem of tools and frameworks of every kind that all are able to interoperate as part of the same ecosystem because of their shared definition in terms of HTML and JS (and many of which contain their own thriving internal ecosystems).
I agree, but my point is that I am also copying that part of the patterning. I give near complete control of the parser VM to the authors of parsers, and I give near complete control of the data structure to the tools which need to hold data.
They do that with Debug Adapter Protocol as well and it’s awful. Instead of formalizing the mechanisms actually used by debugger provided features you get some distorted view that seems directed to building specifically what VSCode needs from some UI perspective (eval expression hover?!). This leaves obviously glaring holes like ability to readMemory but not query what is mapped, which symbols are available for a scope etc. Signals, what are those? Subprocess controls? nope, no need for that in MsWin etc. You end up writing debugger specific ‘repl’ workarounds for basic things and the point of a protocol is moot.
I think that aspect of it rather benefits Microsoft, don’t you? It essentially means that nobody can use their technology to build a tool that looks or works substantially different than theirs does, and thus which would be a potential competitor.
If your goal was to create an ecosystem of rich competition among tooling developers your only logical course of action would be to expose to them as much accurate detail and as much accurate abstraction as you could possibly provide. This would facilitate the building of rich, arbitrary experiences which could give users tools that understand whatever they can understand about a language, thus enabling them to do any kind of thing they could imagine being able to do – answer any query, specify any change…
For insignificant me, the barrier to replacing LSP is very high. I liked the idea from the start, liked the documentation and was impressed by the end result where a language server I wrote worked excellently in Vim and VS Code.
Clearing such a high barrier requires actually being able to build the better system and show people better tools. I will have done my job correctly if you want to come implement your language on my system because of all the amazing benefits you’ll get in terms of having your language supported in many BABLR-powered tools.
Most of the chatter happens in a Discord server. I could probably host a mirror of the current git repos, but my bigger priority is to move the source control past git.
Oof. Another closed platform–assuming that is hooked up to a gateway?
I’d totally support using a simpler version control tho. I’ve been having a good time with Darcs recently & still want to follow the Pijul story if the forge situations & other tooling happen to mature a bit more.
No gateway either yet, sorry. Some aspects of the engineering I take a very principled approach to, and the rest of the time I just use what works and is in reach. What would you want to see a gateway to?
And yes, I too am closely following the Pijul story : )
Caught my eye today. Since I don’t see the C bindings from Rust, if this project can succeed, it would be more portable since every language has C FFI/bindings.
Ah, UTF-16.
Some years ago (2019, it turns out!) I noticed my mail client was mangling payroll emails from Xero. Turns out, they were specifying UTF-8 text in the MIME type, then cheerfully delivering UTF-16 text, presumably from a .NET service. Took a while but that eventually got escalated to the devs where it was (I assume) a simple fix.
I suspect this must happen a lot at Microsoft shops when interfacing with the real world ;)
I think this has nothing to do with Windows and Microsoft: the root cause here is that JS strings are UTF-16, and VS Code uses Electron — not something pioneered at Microsoft.
Yeah, but everyone else acknowledges it as a mistake you need to work around as soon as possible and would never allow a design bug like that to make it into the final spec.
If I wanted to solve general state synchronization problem, I’d spend some time studying https://github.com/JetBrains/rd in detail. My prior is high that that is good stuff.
Come join the work on BABLR, the first serious attempt to replace LSP with something much, much (much) better.
Let’s review how BABLR is better:
It’s interesting that you say it’s better to model the semantics of the language. The LSP consensus seems to be that it’s better to model what the editor displays.
Yeah, I wish I was wrong on this one, but I am skeptical about the usefulness of cross-language semantics models.
This is based on success of LLVM in building a universal backend and on un-success of JetBrains in building a universal front-end. There are a couple of semantic abstractions in IntelliJ, and they work well for any language, as long as this language is Java.
That’s the thing: most language are the same from the back end, but fundamental differences in front-ends are precisely the reason we have different languages.
I wish LSP was less of a least commmon denominator, that it was easier to extend it with language-specific and editor-specific features. Which is the case technically, but not organizationally.
Basically, someone unaffiliated with Microsoft should maintain an lsp-extras repository, which all servers and clients use to document their custom extensions, and which also surfaces common extensions shared by multiple parties.
This isn’t happening for a neutral reason that solving coordination problem is hard, and for a bad reason because one would think that one could just contribute upstream.
I’ve just realized the irony that, technically, LSP is extremely well positioned to get embraced, extended and extinguished, but that the OSS community is incapable of taking advantage of that (I am of course one of the responsible non-doers here)
There are a lot of things that are common across languages. Is this token a keyword or an identifier? If it’s an identifier, does it name a type or a value? If it names a type, your language might have different categories of type, what are they?
I want to be able to configure my editor to say ‘keywords are red, comments are green’ (for example) and have the language-specific logic annotate tokens with enough information that the editor can put them in these generic categories and expose some UI that says ‘this language has this extra category of thing. It’s a special kind of type, how do you want to render that?’. By default, maybe Objective-C classes and ML records are both highlighted as types, but the LSP-equivalent for each language could tell the front end that it has these extra kinds of type and maybe I want to have finer-grained colouring.
And they you have Zig, where the answer is “yes”, and the question you want to ask “is this
comptime?”.That’s presentation, and that’s exactly how LSP works (and, how I argue, the similar thing should work). There isn’t a request that returns an annotated syntax tree. There’s a request that returns syntax highlighting information, which tells you that, for the purposes of syntax highlighting, this token should be considered a keyword. This makes it very cheap to model extensions, like having moves underlined in your editor. Modeling moves at the syntax highlighting level is easy: you say this range is annotated with a custom tag
rust.move, and then a rust-knowing color theme can pick it up. Modeling moves in a language-agnostic semantic manner is much harder, because the semantic model likely just doesn’t have an obvious place to describe what a move is.That’s the thing, you model syntax highlighting, and not the ast, in the protocol.
Isn’t the syntax highlighting model kind of just an alternative representation of the AST? And we treat that data as kind of untrustworthy and second-rate. The user is allowed to see it but not interact with it. Comments are highlighted grey, but you can’t do a code search that omits results in comments.
I believe there are reasonable solutions that involve only a single model, and I think there are some very strong benefits that come from offering APIs that deal directly in data structures. The alternative is to keep all the sources of truth behind a facade or REST API, which creates kinds of systems which have nightmarish state syncing concerns.
Not really:
My gripe with syntax highlighting as a whole features is that it takes the complicated kinds of situations you’re talking about and flattens them all out. When you have no semantic model you’re left taking text ranges and trying to map them onto something (e.g. a character -> color), which is generally going to be an oversimplification for exactly the reasons you are mentioning.
When you do have a semantic model it’s not any trouble for the model to consist of a Rust tree parsed from stringish content stored in a markdown tree which itself is parsed from stringish doc comment content in a Rust tree.
If you insist that it isn’t possible to highlight without semantics you’re asserting that the best possible editor would be too stupid to understand literally the first thing about syntax or code, which doesn’t square with the most basic common sense. If I’m fluent in a language I won’t have to keep telephoning my friend Steve to say, “HEY STEVE, COULD YOU REMIND ME AGAIN IS ‘GET’ A VERB OR A NOUN IN ENGLISH”
We are having this conversation on the internet, in a web browser – the universal-est frontend of them all.
This is why I am not convinced by the argument that the intricacies present within the huge tree of families and subfamilies of related languages should mean that it is not possible to build any reusable semantic tools. The only real question is how reusable any given tool would be. In a workshop full of tools it is not strange that many of them have use only in specific circumstances, or are even perhaps tools that are only useful in the context of interacting with other purpose-specific tools.
The power of something like web frontend is the tremendous multicultural ecosystem of tools and frameworks of every kind that all are able to interoperate as part of the same ecosystem because of their shared definition in terms of HTML and JS (and many of which contain their own thriving internal ecosystems).
But the browser only works as a universal frontend because site authors have near-complete control over presentation.
I agree, but my point is that I am also copying that part of the patterning. I give near complete control of the parser VM to the authors of parsers, and I give near complete control of the data structure to the tools which need to hold data.
They do that with Debug Adapter Protocol as well and it’s awful. Instead of formalizing the mechanisms actually used by debugger provided features you get some distorted view that seems directed to building specifically what VSCode needs from some UI perspective (eval expression hover?!). This leaves obviously glaring holes like ability to readMemory but not query what is mapped, which symbols are available for a scope etc. Signals, what are those? Subprocess controls? nope, no need for that in MsWin etc. You end up writing debugger specific ‘repl’ workarounds for basic things and the point of a protocol is moot.
I think that aspect of it rather benefits Microsoft, don’t you? It essentially means that nobody can use their technology to build a tool that looks or works substantially different than theirs does, and thus which would be a potential competitor.
If your goal was to create an ecosystem of rich competition among tooling developers your only logical course of action would be to expose to them as much accurate detail and as much accurate abstraction as you could possibly provide. This would facilitate the building of rich, arbitrary experiences which could give users tools that understand whatever they can understand about a language, thus enabling them to do any kind of thing they could imagine being able to do – answer any query, specify any change…
For insignificant me, the barrier to replacing LSP is very high. I liked the idea from the start, liked the documentation and was impressed by the end result where a language server I wrote worked excellently in Vim and VS Code.
It is very high!
Clearing such a high barrier requires actually being able to build the better system and show people better tools. I will have done my job correctly if you want to come implement your language on my system because of all the amazing benefits you’ll get in terms of having your language supported in many BABLR-powered tools.
Do you have mirrors? Otherwise we’re still stuck on a Microsoft service to talk & code about adoption of the non-Microsoft language protocol.
Most of the chatter happens in a Discord server. I could probably host a mirror of the current git repos, but my bigger priority is to move the source control past git.
Oof. Another closed platform–assuming that is hooked up to a gateway?
I’d totally support using a simpler version control tho. I’ve been having a good time with Darcs recently & still want to follow the Pijul story if the forge situations & other tooling happen to mature a bit more.
No gateway either yet, sorry. Some aspects of the engineering I take a very principled approach to, and the rest of the time I just use what works and is in reach. What would you want to see a gateway to?
And yes, I too am closely following the Pijul story : )
https://git.sr.ht/~laumann/ani
Caught my eye today. Since I don’t see the C bindings from Rust, if this project can succeed, it would be more portable since every language has C FFI/bindings.
I concur as a maintainer of a language server and a past maintainer of a client (emacs-lsp).
I have implemented LSP semantic tokens in the weekend and just discovered a nasty refresh issue (perhaps discuss here: https://lobste.rs/s/j3ed91/ccls_lsp_semantic_tokens).
Doesn’t the json-rpc used by LSP have the problem that it’s basically jsonrpc1.0 while claiming it’s jsonrpc2.0 ?
[Comment removed by author]