You don’t need incremental parsing even. Parsing from scratch is fast enough most of the time.
IntelliJ doesn’t generally do incremental parsing, and yet it has full access to syntax tree all the time.
Incremental parser is helpful, but by no means it is table stakes.
What you need though is resilient parser, a parser that can deal with incomplete syntax. This is table stakes, and not typically available.
Other that that, agree with a thesis: almost everything you can get with a structural editor you can get faster and cheaper by just writing a parser which doesn’t bail on the first error, and you get all text-editing goodies like multiple cursors for free.
In some sense, one can do anything in manually written parsers if one has the skill and the patience to do so. I hope I’m not claiming that Wagner’s approach does something that’s impossible in other settings!
That said, Wagner’s approach does have some cool things that can be harder (not impossible, but harder) in a “reparse everything” setting. For example, its approach to error recovery can use “history”: if you make a program syntactically invalid it can use a past parse of that part of the program to try and fix it, and then move on. I don’t think that’s the only thing one needs to do for error recovery – it doesn’t work very well for new portions of a program, for example – but it’s an example of something that requires some sort of memory / caching.
I didn’t try the resilient LL(1) technique, but it seems like it could be a good mix of powerful and expressive vs. easy to write. Obviously it again depends on the language
I am trying to decide if shell is a special case or not, as far as the complexity of the syntax … I think JavaScript is starting to resemble shell, because you have JSX literals in JS, and then you have JS and CSS in HTML, etc. You also have TypeScript
i.e. there are more arbitrary interleavings of composed languages in modern JavaScript.
I think the problem is that the original authors of those tools are NOT necessarily using grammars, so a given metalanguage can’t necessarily handle what they have done with arbitrary code, at least not easily.
I am trying to decide if shell is a special case or not, as far as the complexity of the syntax
I wouldn’t say it’s exactly a special case, because there are a number of other languages whose grammars can not be specified in a way that we can meaningfully statically reason about (give me an LL/LR grammar and I’ll guarantee strong properties about it; give me, say, a PEG and… I can’t). But it is definitely in the “I wish the syntax hadn’t evolved in this way, because it’s difficult to reason about for humans and machines” category!
Shell has the same issue that Ruby has – you have to parse it to lex it:
It might be the case that Ruby isn’t possible to lex without parsing.
A reason in shell is that
$(case x in x) echo hi ;; esac)
has 1 left paren, and 2 right parens, and you have to parse to know where the $( ends. You can’t just lex.
And I also think that complicates any kind of incremental parsing, if the lexer has to invoke the parser.
So then to me a very fruitful research question is to find a parsing metalanguage that can express shell and Ruby (and JS and C++, etc.)
I think Treesitter comes very close, but doesn’t hit the mark. Again, to the authors’ credit, they tested it on some of the most difficult languages, but perhaps not the Ruby/shell level of difficulty.
I noticed that the Wagner incremental lexing can accept arbitrary lexers via runtime instrumentation, rather than static analysis. (notably, Treesitter doesn’t implement Wagner incremental lexing – though you have to use its peculiar interface!)
On the other hand, Wagner incremental parsing is grammar-based. But IMO this restriction makes it infeasible for shell and Ruby. I suppose it’s an open question if some approximation is “good enough”, but I don’t think so, and I don’t see it in practice.
Originally I thought I would use PEGs for https://www.oilshell.org/, but PEGs don’t traditionally separate lexing and parsing, and there is a non-trivial interface to the lexer in OIls. For example, there is a stack of “hints” in the lexer to to find the closing ) in the case above, and there are a few other mechanisms too.
But our parsing style [1] is not unlike PEGs, and it is expressive. It turned out to be capable of expressing 4 interleaved sublanguages with different lexical syntax, as well as 6 or so mini-languages. (We also interleave with a generated parser!)
I haven’t looked in detail at work on incrementally parsing PEGs [2], but if making PEGs incremental is possible, then I think making the Oils model incremental is probably also possible.
That is, recast it as something more like a set rather than an algorithm, like PEGs. I think in practice having the lexer/parser separation is useful, and it probably doesn’t complicate the theory too much (?)
As a concrete example, PEG has “lookahead assertions”. Somewhat late in the Oils development, I added a method to “look ahead with a certain lexer mode” to handle a few language constructs. A lexer mode is simply a mapping of regular languages to tokens.
So that kind of thing is a small difference that I think should be possible to formalize, and recast as a set, although certainly there could be other issues that make it harder.
I would also conjecture that if such a metalanguage can handle shell, it can handle Ruby and JavaScript and TypeScript.
C/C++ may need an additional hack for the symbol table -> lexer feedback, and possibly the “most vexing parse”.
It might be the case that Ruby isn’t possible to lex without parsing.
This has been my experience, having written a from-scratch Ruby parser/interpreter/etc. that gets all the hard stuff right — doing it the traditional lex/yacc way, you need many “lexical tie-ins” (i.e. situations where the parser modifies lexer state).
AFAICT Rocq (née Coq) is LL1 but parsing requires partial evaluation (extensible syntax with notations let one sentence change the parser before it parses the next one). I’ve only heard dire warnings about trying to implement your own parser.
IntelliJ Java parser is incremental. It is absolute table stakes. Without it you wouldn’t be able to get the snappiness that IntelliJ has even in modest files. Source: wrote lots of code in IntelliJ back then. Including lots of different parsers.
Thanks, I’ll double check this! Let me be more precise about my current beliefs about “generally doesn’t do” for the record, to pin down exactly how wrong I am. This is about the state around 2016-2017, haven’t looked at the code much since then:
Anything that builds on top of Grammar-Kit is not incremental
Dart plugin in particular doesn’t do even incremental text sync with the analysis server
XML parsing is incremental, as the files are too big
I don’t know about Java&Kotlin parsers specifically, they might be incremental
Every lexer is incremental
On top of the full parser, there’s a tree-diffing layer that makes updating the PSI tree incremental.
Hm, I think this is basically right? Java is parsed incrementally, but there’s no direct support for incrementally in GranmarKit and many other language are not incremental. In particular, Python only grew support for incremental parsing this year:
Which is to say: incrementality is useful, but you probably can get by without it. This is in contrast to resilience, without which the thing simply won’t work most of the time.
That being said, my original wording is misleading! It does imply that Java parsing is not incremental, and that is plain wrong, thanks for chiming in here to correct that!
I don’t have IntelliJ around anymore but I bet that if you open 20k lines file and start typing you could easily tell which parser is incremental and which is not.
So you’re probably right: it is not essential. But if you want to dethrone IJ, then it is :)
Sorry, don’t know anything about grammar kit. Last time I touched idea was probably 2008-2009 :) So you should replace “is” with “was” below.
As for Java incremental parser: it is important to understand that IntelliJ incremental parser is not optimal: i.e. it doesn’t guarantee minimal reparse, but something that works well in practice. E.g. if I remember correctly, code block is the unit of reparse, which could still be pretty big but is much smaller than the whole file.
Another case where incremental parsing is essential is when you type in ‘{‘: you don’t want full file structure to break.
Additional aspect of incrementality is to detect when parse yields the same tree structure and do not generate node updates (I’d call this incremental parse tree).
And on somewhat generic note: to be snappy you need to press ctrl+space and see completion in < 100ms. That includes doing a lot of work, re-parsing and re-resolving lots of files, which means that in reality parsing update should be on the order of milliseconds. We’ve got Eclipse engineers including EG coming to our boothes and asking why are we so fast? They never really got that it starts from the lowest level of code manipulation.
PS all credit for the speed goes to incredible kip for building PSI and later max.
Yeah, that’s also a very important point: you don’t need to invent a general incremental parsing algorithm, a simple heuristics work and it doesn’t actually require modifying the parser itself much.
One related anecdote here is that rust-analyzer included more or less the same block reparsing heuristic from the beginning, but, as far as I am aware, it is not actually enabled still. Not sure if this is because computers in 2018 were faster than in 2008, because its JVM vs Rust thing, or due to inherent asynchrony of LSP, but parsing never was near the top of cost-centers for completion, it was always Rust global-ish type inference with traits.
I think the main use case for incremental parsing would be as a basis for incrementalizing complex and costly analyses, maybe even compilation.
Imagine analyses (or even compilation) implemented with queries like “find the main function”, “get constructors of type X” etc. and a framework that records queries done to answer one query, and cache results.
(I think the idea is similar to Rust’s Salsa and/or Adapton, but I don’t have too much experience in any of them, so I’m not sure.)
Incremental parsing (as described by Wagner’s thesis, mentioned in the blog post) gives you the changed nodes in the AST, which you can use to invalidate results of queries that use those nodes, and start recomputing from there.
Without incremental parsing you have to compare the program AST before and after an edit to find which nodes are updated and invalidate query results based on that. I suspect it may be more efficient with incremental parsing.
Not sure! The thing is, if you add a blank line at the start of the file, you need to recompute all error messages (as they include line numbers) but keep the errors themselves (as nothing about the semantics of the file changes).
The way this works is that you have a sort-of recompilation firewall query, which takes raw syntax tree, and splits it into volatile parts (text offsets, specific concrete syntax choices, etc) and stable parts (the set of names defined in a file). And most of costly recompilation is driven by the changes in that small, stable part. And, as the firewall query is arbitrary code, its is kinda hard to incremntalize it automatically, but also it is cheap itself, so diffing its results is fast enough.
IOW, you can’t directly feed incremental tree updates into semantic analysis.
you need to recompute all error messages (as they include line numbers)
Why not just compute the error messages on the fly using the current positions? Sum trees make it log(n) to figure out line and column information on the fly without needing to store it
Another interesting point related to this topic might be what Unison does. To my knowledge, it stores basically “AST-nodes” in its own, normalized format. This gives the language automatic deduplication of identical functions (even when they are not textually identical, only semantically, e.g. whitespace difference).
Treesitter’s original application is smart syntax highlighting, and Treesitter is incremental … I’m pretty sure that will be too slow without incremental parsing.
e.g. compare it with Vim – it updates on every keystroke, does the weaker kind of syntax highlighting, and it is line-based.
If you want syntax highlighting be as responsive as Vim, and if Treesitter were NOT incremental, I don’t think that would work well. I think a full parse is too slow on every keystroke
I think IntelliJ must have more Vim-like syntax highlighting, not something like Treesitter? Not that this is bad, I think there can be a better compromise than Treesitter.
Although I kinda want to investigate some kind of “coarse parsing” first. So instead of using a line as a window Vim, then maybe you can use some delimiters like {}. Although then that also has the resiliency issues.
Not a good point of comparison as vim (to the best of my knowledge) doesn’t do any “parsing” so much as it does “brute force regex all things” with different regex rules for each syntax element. I don’t think it does it just once, either - pretty sure it tries repeatedly as the “syntax window” shifts. You’d be laughed out of the ballpark if you shipped a compiler (or whatever) that “parsed” your language in the same fashion vim does.
(This applies to neovim as well, but neovim has a still-in-the-works “native” TreeSitter integration to change that.)
I’m comparing them because both vim and Treesitter do syntax highlighting, and I believe Treesitter is incremental because doing full parses would be too slow for that use case.
I also think it should be possible to get the best of both worlds – accuracy and speed.
I think IntelliJ must have more Vim-like syntax highlighting, not something like Treesitter?
Since I’m in a memory trip lane: IntelliJ highlighter is multi-layered. On the lowest level highlighter uses lexer information, but keeps adding more highlights when there’s more information available. That includes syntax trees and even resolved references (semantic highlighting).
I wrote a syntax highlighter using libclang around 15 years ago and it did three things:
Use precompiled headers so it didn’t need to reparse the megabytes of headers in a typical C-family compilation unit (not necessary for other languages).
Tokenize immediately, parse later.
Keep highlighting from previous parses until it was replaced.
It was not incremental and the syntax highlighting often lagged a second or so behind (on a 1.2 GHz P3 Celeron), but it was pretty usable.
Treesitter’s original application is smart syntax highlighting, and Treesitter is incremental … I’m pretty sure that will be too slow without incremental parsing.
Tree-sitter is not the right solution if you only need syntax highlighting, syntax highlighting can be done with a lexer/tokenizer because you don’t need the tree structure for syntax highlighting, and lexing will be much faster to run from scratch and also easier to make incremental if it’s still not fast enough.
Unless you’re using a language like C++ where you can’t tell the difference between a variable declaration and function declaration without type information.
you need LSP semantic highlighting to correctly highlight Most Vexing Parse, tree-sitter doesn’t work (iirc it always highlights it as a function declaration)
I am kind of going on a tangent here, because the original post is about structured editing, not syntax highlighting
But the original goal of Treesitter was to use grammatical structure for syntax highlighting, not just lexical structure, like say Vim or SublimeText. This was clearly stated in one of the talks - https://tree-sitter.github.io/tree-sitter/#talks-on-tree-sitter
Tree-sitter enables reference highlighting, where various identifiers can be highlighted differently depending on whether they’re referencing a variable, function, parameter, constant, etc. GitHub uses this for some languages, and also extends it to cross-file reference resolution with stack graphs. Nvim-treesitter does have in-file reference highlighting but it is very slow so I usually disable it once the file grows beyond a certain size.
I’m pretty sure that will be too slow without incremental parsing.
That’s exactly the thesis of matklad’s post- it’s often easily fast enough without incremental parsing. I have had similar experiences with parsing in response to keystrokes in an editor, and you can see a lot of LSP implementations (which provide “semantic” syntax highlighting based on the full parse tree) also work this way. This includes IntelliJ- as matklad put it, “it has full access to syntax tree all the time.” The expensive thing here is usually not parsing itself but all the analysis done on the result.
It’s possible tree-sitter itself might be too slow without incremental parsing, but tree-sitter uses a relatively expensive parsing algorithm (for good reason, don’t take this as a dunk on tree-sitter in any way) so that wouldn’t really be indicative of the general “parse from scratch” approach. (At the same time, outside of syntax highlighting, I have done some benchmarking of non-incremental tree-sitter against things like clangd, and it’s still much faster than those “traditional” frontends…)
it parses all languages, not a single language. If you are building a universal solution, you might as well incur more engineering complexity as it amortizes better
IIUC, tree sitter was originally meant to be used alongside a scripting language, so you’d have one JS object per syntax node. And avoiding churning through GC objects is worthwhile, and incrementally give you that. IntelliJ also avoids re-allocating the entire PSI tree, but it does that through tree diffing, not through incremental parse.
Well I sort of changed the subject from structured editing to syntax highlighting
But I would say you should have a 10-20 ms budget per keystroke, for syntax highlighting. (As a data point, 16.6 ms is 60 frames per second). Ideally it would be (much) less than 1 ms, but that’s a maximum. Because the CPU has to do other things too.
That’s definitely enough to parse small files, but large files are a problem.
Many editors and IDEs choke on large files, while Vim generally doesn’t. One solution is to just bail out for large files, since 99% are small. You can “degrade” to a text editor in that case.
Important clarification from mikea below: for Java, IntelliJ is incremental. But it’s not incremental “by default”, languages have to opt-into incrementally, and many don’t, or do it rather late in the lifetime of a particular language plugin.
That’s the crux of caching: you don’t need it if the base system can handle the load. I’d go even further and claim that if language and compiler are co-designed for efficiency, you might not even need incremental compilation majority of the time.
As far as I understand, the author of Treesitter succeeded in writing a C/C++ plugin, and a JS/TypeScript plugin.. And this is a very impressive test of the metalanguage.
But he didn’t get very far with the bash plugin :-/ Because bash has a lot of “extra-grammatical” stuff in it
I didn’t look in detail, but my initial impression is to be concerned that any grammar-based approach is not powerful enough for shell either, and maybe HTML/JS/CSS. At the very least, I think you need some “escape” mechanisms into arbitrary code, and that might not be enough either.
e.g. there was a 2018 paper on statically parsing POSIX shell, and they had to extend the Menhir’s metalanguage (OCaml) for it:
“How can I select the current expression/function/argument/statement?”
And this works as long as your language is blessed by the Tree-sitter grammarians.
I am not sure Treesitter is appropriate for what many people are building on top of it, but it seems to be trend
I understand that because it exists, and works in many cases in production. But IMO it is not expressive enough, and there are some other criticisms that I linked here:
At the very least, I think you need some “escape” mechanisms into arbitrary code, and that might not be enough either.
Tree-sitter has this escape mechanism in the form of an “external scanner” in C which can store arbitrary state and can use whatever method it wants to classify the next token. The input to this external scanner is a list of possible valid tokens. You can get quite complicated behavior in the interplay between the parser driving the scanner and the scanner responding to the parser! I use this heavily in the TLA+ tree-sitter grammar to parse context-sensitive constructs like vertically-aligned conjunction & disjunction lists and also proof steps where the actual number in the step determines what proof group a step is part of.
From your post you linked at the end:
The top comment is from the author of difftastic (the subject here), saying that treesitter Nim plugin can’t be merged, because it’s 60 MB of generated C source code. There’s a scalability problem supporting multiple languages.
Yes, this is certainly a problem with LR(1) grammars - they are huge! The TLA+ tree-sitter grammar currently clocks in at 35 MB of generated C code, which compiles down to about 3 MB. This constitutes the main constraint on any further features I want to add to it - often they will blow up the size of the generated code drastically.
[It] segfaults constantly … More than any NPM module I’ve ever used before. Any syntax that doesn’t precisely match the grammar is liable to take down your entire thread.
This is almost certainly an issue with the hand-written external scanner. My scanner had a few segfaults in it which I wrote about here: https://ahelwer.ca/post/2023-02-04-cpp-bugs/
The latest tree-sitter release included a command to fuzz your external scanner which is nice.
I’m looking forward to that! I need JS for myself, and OSH and YSH are also very high on my list of grammars to write.
It has taken a long time because so far I’ve almost always chosen to iterate on the architecture in order to gain capabilities on the tooling side before I risk the kind of lock-in that supporting popular languages could bring.
Incremental (oops) error-tolerant parsing sucks because it creates tools which do not understand a user’s intent. Once intent is lost the results of further edits in the IDE are highly unpredictable and often damaging.
This happens because once the input is unparsable an error recovering parser can only guess what might have been meant. This is the peak of “garbage in, garbage out”, a rule that people mysteriously forget in their excitement about error-tolerant parsers.
So on one hand you have error-recovering parsing, an unsolveable problem that requires you to correctly guess what someone was thinking without knowing. A non-starter. No point for serious-minded people to waste any of their time trying to solve a problem that is impossible.
On the other hand you have the problem of making a nice UX for cross-structural edits. This is not only possible, but there’s huge amounts known, written, and implemented about how to make tools with significant creative, expressive power.
Forest (paper) and Cursorless are two of my personal faves that have a primary interaction focus on keeping the edit flows working in a way that doesn’t involve textual box selection
Hazel is still preoccupied with text box selection, which is bizarre because they’ve invented the concept of an embedding gap: the single extremely rare thing in IDEs that you really need above all to escape the box selection paradigm (which even Forest and Cursorless lack despite their focus in that direction).
An example I use often is Javascript’s if( ) syntax. It’s an extremely common thing for people to intend insert this fragment of syntax with the intent to fill out the condition and the consequent. But there are two problems at first, the statement is completely unparseable since JS requires an expression between the if parens.
Once you fix the basic syntax problem there the second problem happens: whatever statement happens to be below if(condition) suddenly becomes the body of the if statement, making it conditional. There is no parser error at this moment, so error recovery can’t help! The code is fully valid, it is just out of sync with what the user was intending to do which we will say in this case was to fill in the body with a new statement.
Hazel gives us the tools to deal with this by having gaps, which they represent as a mid dot ( · )
If you use this tool you can see how we can reverse the patterns and capture intent by inserting
if( · ) ·
Now both problems are fixed! The document isn’t broken according to the rules of the language anymore (it’s just incomplete) and there’s no chance that the statement below will slide up into this node when we don’t want it to because we’ve created an appropriate placeholder. If it is the users intent to place the next statement as the body, it will be a triviality for them to express this by moving that statement into the body gap
Regex is another one I deal with all the time. Try typing // as the empty regex and watch even the best error-recovering parser creates a comment and wrecks your program beyond sense. But using a gap like /·/ would capture my intent in a way that the IDE wouldn’t be unavoidably forced to make a huge mess in the course of completely normal work
Basically these gaps completely zero out the distance between the conceptual model used for textual programming and the block-with-hole model used by the best existing visual tools like https://scratch.mit.edu/ which are highly semantic but currently inefficient at representing real code because of their failure to embrace syntax as useful way of compactly viewing programs
Does anyone here actually use a structured/projectional editor as their everyday text editor? With the exception of Lisp programmers using paredit, I don’t think I’ve ever seen structured editing in the wild.
I’d argue that structured editing is extremely common, it’s just that we usually perform it using APIs as opposed to UIs. A linter, transpiler or codemod system is a kind of structure editor in that you interact with a semantic model of the program instead of the textual embedding in order to make changes
Well, sure, I guess, but I’m interested in knowing whether anyone uses a structured editor as their everyday text editor. No-one writes code with a linter or a transpiler: those are tools that you use to process code that you’ve already written with a text editor.
I always enjoyed using Paredit when writing Lisp but I don’t know of any other structured editor that’s widely used. I suppose Vim’s text objects are similar but not really the same. I saw the Ki Editor mentioned in @andyc’s comment and thought that looked cool: I wonder if anyone here has any experience using something like that.
Hey @c– I’m the author of Ki Editor, and Ki has been my main driver for almost a month after I’ve completely bootstrapped it from Neovim. It is sort of like a subset of Paredit, but not restricted to Lisp languages, so operations like Swap Sibling Nodes, Delete Current Node, Select Current Node, Go to first/last Node works as long as the tree-sitter grammar is defined for the language, for example, I use Ki daily for structural editing Swift, Rust, Python, and Typescript.
And no, structural editing in Ki is not an afterthought, merely supporting structural selection or operations is not enough, just like a Chinese restaurant serving spaghetti does not make it a Western restaurant. So even Neovim and Helix cannot be counted as a structural editor although they do support tree-sitter operations.
But then, structural editing is really not for everyone, it doesn’t click if you don’t know how syntax tree works.
Editing code in Ki is so efficient that I don’t know how to put it in terms, it makes Vim/Kakoune/Helix motions feel like a jumbled mess of keyboard shortcuts.
Thank you for replying here: that’s fantastic. I didn’t realize Ki is so new.
it makes Vim/Kakoune/Helix motions feel like a jumbled mess of keyboard shortcuts.
This is what I was wondering about. Vim’s text objects seemed amazing when I first understood them but they are very much a poor second to something like Paredit. I was wondering if, in practice, it is possible to get the Paredit experience with a non-parens-based syntax. It sounds like it is!
I think we need a more structural editing solution, but we cannot go fully structured: that will require either inventing a new language (extremely difficult on its own), or supporting sometimes hundreds of types of expressions and statements in a structural way (not sure if possible conveniently).
What I propose in my blog post is a middle ground where you have structural editing at the top-most layers, starting from packages to statement/expression level, and from there you edit as before. I think it can be done, and should give us a lot of the benefits of structural editing.
I got around this by inventing a new language that isn’t a programming language but is rather a data description language not unlike HTML. Most people forget that structured editing is a solved problem on the internet to the extent that I’m using a live structure editor to compose this post right now.
You don’t need incremental parsing even. Parsing from scratch is fast enough most of the time.
IntelliJ doesn’t generally do incremental parsing, and yet it has full access to syntax tree all the time.
Incremental parser is helpful, but by no means it is table stakes.
What you need though is resilient parser, a parser that can deal with incomplete syntax. This is table stakes, and not typically available.
Other that that, agree with a thesis: almost everything you can get with a structural editor you can get faster and cheaper by just writing a parser which doesn’t bail on the first error, and you get all text-editing goodies like multiple cursors for free.
See https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html for explanation of resilient parsing that works in production.
In some sense, one can do anything in manually written parsers if one has the skill and the patience to do so. I hope I’m not claiming that Wagner’s approach does something that’s impossible in other settings!
That said, Wagner’s approach does have some cool things that can be harder (not impossible, but harder) in a “reparse everything” setting. For example, its approach to error recovery can use “history”: if you make a program syntactically invalid it can use a past parse of that part of the program to try and fix it, and then move on. I don’t think that’s the only thing one needs to do for error recovery – it doesn’t work very well for new portions of a program, for example – but it’s an example of something that requires some sort of memory / caching.
I think there are cases where manually written parsers take less skill and patience
i.e. not everyone has a good time writing a Treesitter grammar – it depends on what you’re trying to parse
https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing#c_bfbtbr (thread about Markdown and shell and the Helix editor, also linked in my top-level comment)
I didn’t try the resilient LL(1) technique, but it seems like it could be a good mix of powerful and expressive vs. easy to write. Obviously it again depends on the language
I am trying to decide if shell is a special case or not, as far as the complexity of the syntax … I think JavaScript is starting to resemble shell, because you have JSX literals in JS, and then you have JS and CSS in HTML, etc. You also have TypeScript
i.e. there are more arbitrary interleavings of composed languages in modern JavaScript.
I think the problem is that the original authors of those tools are NOT necessarily using grammars, so a given metalanguage can’t necessarily handle what they have done with arbitrary code, at least not easily.
I wouldn’t say it’s exactly a special case, because there are a number of other languages whose grammars can not be specified in a way that we can meaningfully statically reason about (give me an LL/LR grammar and I’ll guarantee strong properties about it; give me, say, a PEG and… I can’t). But it is definitely in the “I wish the syntax hadn’t evolved in this way, because it’s difficult to reason about for humans and machines” category!
Yeah I think shell and Ruby have similar syntactic complexity, judging by this post on “weird lexical syntax”
https://justine.lol/lex/
and by the recent posts on the Prism Ruby parser:
https://lobste.rs/s/4jpfyy/prism_2024_history_ruby_parsers
Shell has the same issue that Ruby has – you have to parse it to lex it:
A reason in shell is that
has 1 left paren, and 2 right parens, and you have to parse to know where the
$(ends. You can’t just lex.And I also think that complicates any kind of incremental parsing, if the lexer has to invoke the parser.
So then to me a very fruitful research question is to find a parsing metalanguage that can express shell and Ruby (and JS and C++, etc.)
I think Treesitter comes very close, but doesn’t hit the mark. Again, to the authors’ credit, they tested it on some of the most difficult languages, but perhaps not the Ruby/shell level of difficulty.
I noticed that the Wagner incremental lexing can accept arbitrary lexers via runtime instrumentation, rather than static analysis. (notably, Treesitter doesn’t implement Wagner incremental lexing – though you have to use its peculiar interface!)
On the other hand, Wagner incremental parsing is grammar-based. But IMO this restriction makes it infeasible for shell and Ruby. I suppose it’s an open question if some approximation is “good enough”, but I don’t think so, and I don’t see it in practice.
Originally I thought I would use PEGs for https://www.oilshell.org/, but PEGs don’t traditionally separate lexing and parsing, and there is a non-trivial interface to the lexer in OIls. For example, there is a stack of “hints” in the lexer to to find the closing
)in the case above, and there are a few other mechanisms too.But our parsing style [1] is not unlike PEGs, and it is expressive. It turned out to be capable of expressing 4 interleaved sublanguages with different lexical syntax, as well as 6 or so mini-languages. (We also interleave with a generated parser!)
I haven’t looked in detail at work on incrementally parsing PEGs [2], but if making PEGs incremental is possible, then I think making the Oils model incremental is probably also possible.
That is, recast it as something more like a set rather than an algorithm, like PEGs. I think in practice having the lexer/parser separation is useful, and it probably doesn’t complicate the theory too much (?)
As a concrete example, PEG has “lookahead assertions”. Somewhat late in the Oils development, I added a method to “look ahead with a certain lexer mode” to handle a few language constructs. A lexer mode is simply a mapping of regular languages to tokens.
So that kind of thing is a small difference that I think should be possible to formalize, and recast as a set, although certainly there could be other issues that make it harder.
I would also conjecture that if such a metalanguage can handle shell, it can handle Ruby and JavaScript and TypeScript.
C/C++ may need an additional hack for the symbol table -> lexer feedback, and possibly the “most vexing parse”.
[1] How to Parse Shell Like a Programming Language - https://www.oilshell.org/blog/2019/02/07.html
https://github.com/oils-for-unix/oils/wiki/Why-Lexing-and-Parsing-Should-Be-Separate
[2] e.g. https://people.seas.harvard.edu/~chong/abstracts/YedidiaC2021.html
This has been my experience, having written a from-scratch Ruby parser/interpreter/etc. that gets all the hard stuff right — doing it the traditional lex/yacc way, you need many “lexical tie-ins” (i.e. situations where the parser modifies lexer state).
You can use a just-in-time lexer without falling all the way back to feeding the parser the characters one by one…
This is why the BABLR core is expressed as a GLR parser VM not an LL VM. Effectively you “just write the parser”
AFAICT Rocq (née Coq) is LL1 but parsing requires partial evaluation (extensible syntax with notations let one sentence change the parser before it parses the next one). I’ve only heard dire warnings about trying to implement your own parser.
IntelliJ Java parser is incremental. It is absolute table stakes. Without it you wouldn’t be able to get the snappiness that IntelliJ has even in modest files. Source: wrote lots of code in IntelliJ back then. Including lots of different parsers.
Thanks, I’ll double check this! Let me be more precise about my current beliefs about “generally doesn’t do” for the record, to pin down exactly how wrong I am. This is about the state around 2016-2017, haven’t looked at the code much since then:
Hm, I think this is basically right? Java is parsed incrementally, but there’s no direct support for incrementally in GranmarKit and many other language are not incremental. In particular, Python only grew support for incremental parsing this year:
https://github.com/JetBrains/intellij-community/commit/ba6015d27f49636a9824db4e063fdfc17618b0ef
Which is to say: incrementality is useful, but you probably can get by without it. This is in contrast to resilience, without which the thing simply won’t work most of the time.
That being said, my original wording is misleading! It does imply that Java parsing is not incremental, and that is plain wrong, thanks for chiming in here to correct that!
I don’t have IntelliJ around anymore but I bet that if you open 20k lines file and start typing you could easily tell which parser is incremental and which is not.
So you’re probably right: it is not essential. But if you want to dethrone IJ, then it is :)
Sorry, don’t know anything about grammar kit. Last time I touched idea was probably 2008-2009 :) So you should replace “is” with “was” below.
As for Java incremental parser: it is important to understand that IntelliJ incremental parser is not optimal: i.e. it doesn’t guarantee minimal reparse, but something that works well in practice. E.g. if I remember correctly, code block is the unit of reparse, which could still be pretty big but is much smaller than the whole file.
Another case where incremental parsing is essential is when you type in ‘{‘: you don’t want full file structure to break.
Additional aspect of incrementality is to detect when parse yields the same tree structure and do not generate node updates (I’d call this incremental parse tree).
And on somewhat generic note: to be snappy you need to press ctrl+space and see completion in < 100ms. That includes doing a lot of work, re-parsing and re-resolving lots of files, which means that in reality parsing update should be on the order of milliseconds. We’ve got Eclipse engineers including EG coming to our boothes and asking why are we so fast? They never really got that it starts from the lowest level of code manipulation.
PS all credit for the speed goes to incredible kip for building PSI and later max.
Yeah, that’s also a very important point: you don’t need to invent a general incremental parsing algorithm, a simple heuristics work and it doesn’t actually require modifying the parser itself much.
One related anecdote here is that rust-analyzer included more or less the same block reparsing heuristic from the beginning, but, as far as I am aware, it is not actually enabled still. Not sure if this is because computers in 2018 were faster than in 2008, because its JVM vs Rust thing, or due to inherent asynchrony of LSP, but parsing never was near the top of cost-centers for completion, it was always Rust global-ish type inference with traits.
I think the main use case for incremental parsing would be as a basis for incrementalizing complex and costly analyses, maybe even compilation.
Imagine analyses (or even compilation) implemented with queries like “find the main function”, “get constructors of type X” etc. and a framework that records queries done to answer one query, and cache results.
(I think the idea is similar to Rust’s Salsa and/or Adapton, but I don’t have too much experience in any of them, so I’m not sure.)
Incremental parsing (as described by Wagner’s thesis, mentioned in the blog post) gives you the changed nodes in the AST, which you can use to invalidate results of queries that use those nodes, and start recomputing from there.
Without incremental parsing you have to compare the program AST before and after an edit to find which nodes are updated and invalidate query results based on that. I suspect it may be more efficient with incremental parsing.
Not sure! The thing is, if you add a blank line at the start of the file, you need to recompute all error messages (as they include line numbers) but keep the errors themselves (as nothing about the semantics of the file changes).
The way this works is that you have a sort-of recompilation firewall query, which takes raw syntax tree, and splits it into volatile parts (text offsets, specific concrete syntax choices, etc) and stable parts (the set of names defined in a file). And most of costly recompilation is driven by the changes in that small, stable part. And, as the firewall query is arbitrary code, its is kinda hard to incremntalize it automatically, but also it is cheap itself, so diffing its results is fast enough.
IOW, you can’t directly feed incremental tree updates into semantic analysis.
Why not just compute the error messages on the fly using the current positions? Sum trees make it log(n) to figure out line and column information on the fly without needing to store it
Another interesting point related to this topic might be what Unison does. To my knowledge, it stores basically “AST-nodes” in its own, normalized format. This gives the language automatic deduplication of identical functions (even when they are not textually identical, only semantically, e.g. whitespace difference).
Treesitter’s original application is smart syntax highlighting, and Treesitter is incremental … I’m pretty sure that will be too slow without incremental parsing.
e.g. compare it with Vim – it updates on every keystroke, does the weaker kind of syntax highlighting, and it is line-based.
If you want syntax highlighting be as responsive as Vim, and if Treesitter were NOT incremental, I don’t think that would work well. I think a full parse is too slow on every keystroke
I think IntelliJ must have more Vim-like syntax highlighting, not something like Treesitter? Not that this is bad, I think there can be a better compromise than Treesitter.
Although I kinda want to investigate some kind of “coarse parsing” first. So instead of using a line as a window Vim, then maybe you can use some delimiters like
{}. Although then that also has the resiliency issues.Not a good point of comparison as vim (to the best of my knowledge) doesn’t do any “parsing” so much as it does “brute force regex all things” with different regex rules for each syntax element. I don’t think it does it just once, either - pretty sure it tries repeatedly as the “syntax window” shifts. You’d be laughed out of the ballpark if you shipped a compiler (or whatever) that “parsed” your language in the same fashion vim does.
(This applies to neovim as well, but neovim has a still-in-the-works “native” TreeSitter integration to change that.)
I’m comparing them because both vim and Treesitter do syntax highlighting, and I believe Treesitter is incremental because doing full parses would be too slow for that use case.
I also think it should be possible to get the best of both worlds – accuracy and speed.
Since I’m in a memory trip lane: IntelliJ highlighter is multi-layered. On the lowest level highlighter uses lexer information, but keeps adding more highlights when there’s more information available. That includes syntax trees and even resolved references (semantic highlighting).
I wrote a syntax highlighter using libclang around 15 years ago and it did three things:
It was not incremental and the syntax highlighting often lagged a second or so behind (on a 1.2 GHz P3 Celeron), but it was pretty usable.
Tree-sitter is not the right solution if you only need syntax highlighting, syntax highlighting can be done with a lexer/tokenizer because you don’t need the tree structure for syntax highlighting, and lexing will be much faster to run from scratch and also easier to make incremental if it’s still not fast enough.
Unless you’re using a language like C++ where you can’t tell the difference between a variable declaration and function declaration without type information.
you need LSP semantic highlighting to correctly highlight Most Vexing Parse, tree-sitter doesn’t work (iirc it always highlights it as a function declaration)
If you need type information you also can’t do it properly with a parser.
[Comment removed by author]
I am kind of going on a tangent here, because the original post is about structured editing, not syntax highlighting
But the original goal of Treesitter was to use grammatical structure for syntax highlighting, not just lexical structure, like say Vim or SublimeText. This was clearly stated in one of the talks - https://tree-sitter.github.io/tree-sitter/#talks-on-tree-sitter
Tree-sitter enables reference highlighting, where various identifiers can be highlighted differently depending on whether they’re referencing a variable, function, parameter, constant, etc. GitHub uses this for some languages, and also extends it to cross-file reference resolution with stack graphs. Nvim-treesitter does have in-file reference highlighting but it is very slow so I usually disable it once the file grows beyond a certain size.
That’s exactly the thesis of matklad’s post- it’s often easily fast enough without incremental parsing. I have had similar experiences with parsing in response to keystrokes in an editor, and you can see a lot of LSP implementations (which provide “semantic” syntax highlighting based on the full parse tree) also work this way. This includes IntelliJ- as matklad put it, “it has full access to syntax tree all the time.” The expensive thing here is usually not parsing itself but all the analysis done on the result.
It’s possible tree-sitter itself might be too slow without incremental parsing, but tree-sitter uses a relatively expensive parsing algorithm (for good reason, don’t take this as a dunk on tree-sitter in any way) so that wouldn’t really be indicative of the general “parse from scratch” approach. (At the same time, outside of syntax highlighting, I have done some benchmarking of non-incremental tree-sitter against things like clangd, and it’s still much faster than those “traditional” frontends…)
For tree sitter, two things are importnat:
Well I sort of changed the subject from structured editing to syntax highlighting
But I would say you should have a 10-20 ms budget per keystroke, for syntax highlighting. (As a data point, 16.6 ms is 60 frames per second). Ideally it would be (much) less than 1 ms, but that’s a maximum. Because the CPU has to do other things too.
That’s definitely enough to parse small files, but large files are a problem.
Many editors and IDEs choke on large files, while Vim generally doesn’t. One solution is to just bail out for large files, since 99% are small. You can “degrade” to a text editor in that case.
I’m not sure of the exact threshold, but I think it’s pretty easy to find 100K line files that most tools choke on, e.g. this kind of file https://github.com/microsoft/TypeScript/blob/main/src/compiler/checker.ts
Important clarification from mikea below: for Java, IntelliJ is incremental. But it’s not incremental “by default”, languages have to opt-into incrementally, and many don’t, or do it rather late in the lifetime of a particular language plugin.
That’s the crux of caching: you don’t need it if the base system can handle the load. I’d go even further and claim that if language and compiler are co-designed for efficiency, you might not even need incremental compilation majority of the time.
It feels like many new tools are using Treesitter, which does have incremental parsing, as mentioned in this post
I personally don’t think Treesitter’s metalanguage is powerful enough – specifically the lexing is not good for shell.
I was also going to say that for HTML/JS/CSS, and Markdown-in-say-Rust-in-markdown, though maybe the “injections” are good enough for that?
But I do feel like Treesitter has a big pile of lexing hacks, and it offloads a lot to the user
e.g. see this recent feedback from a Helix editor dev - https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing#c_zxxu35
As far as I understand, the author of Treesitter succeeded in writing a C/C++ plugin, and a JS/TypeScript plugin.. And this is a very impressive test of the metalanguage.
But he didn’t get very far with the bash plugin :-/ Because bash has a lot of “extra-grammatical” stuff in it
I looked a little at this recent work https://soft-dev.org/pubs/html/diekmann_tratt__default_disambiguation/
I didn’t look in detail, but my initial impression is to be concerned that any grammar-based approach is not powerful enough for shell either, and maybe HTML/JS/CSS. At the very least, I think you need some “escape” mechanisms into arbitrary code, and that might not be enough either.
e.g. there was a 2018 paper on statically parsing POSIX shell, and they had to extend the Menhir’s metalanguage (OCaml) for it:
https://scholar.google.com/scholar?oi=bibs&hl=en&cites=15754961728999604986
And POSIX shell is easier to parse than bash.
As a small data point that structured editors are using Treesitter/incremental parsing – in a recent thread on text-focused programming environments - https://lobste.rs/s/elcjqa/missing_text_focused_programming
Matklad pointed to
https://ki-editor.github.io/ki-editor/docs/pitch
which says
I am not sure Treesitter is appropriate for what many people are building on top of it, but it seems to be trend
I understand that because it exists, and works in many cases in production. But IMO it is not expressive enough, and there are some other criticisms that I linked here:
https://news.ycombinator.com/item?id=39783471
Tree-sitter has this escape mechanism in the form of an “external scanner” in C which can store arbitrary state and can use whatever method it wants to classify the next token. The input to this external scanner is a list of possible valid tokens. You can get quite complicated behavior in the interplay between the parser driving the scanner and the scanner responding to the parser! I use this heavily in the TLA+ tree-sitter grammar to parse context-sensitive constructs like vertically-aligned conjunction & disjunction lists and also proof steps where the actual number in the step determines what proof group a step is part of.
From your post you linked at the end:
Yes, this is certainly a problem with LR(1) grammars - they are huge! The TLA+ tree-sitter grammar currently clocks in at 35 MB of generated C code, which compiles down to about 3 MB. This constitutes the main constraint on any further features I want to add to it - often they will blow up the size of the generated code drastically.
This is almost certainly an issue with the hand-written external scanner. My scanner had a few segfaults in it which I wrote about here: https://ahelwer.ca/post/2023-02-04-cpp-bugs/
The latest tree-sitter release included a command to fuzz your external scanner which is nice.
BABLR fixes the issue tree sitter has with lack of expressiveness, especially in the tokenizer layer
I’ll believe it when it supports C/C++, JavaScript/TypeScript, or shell :)
I’m looking forward to that! I need JS for myself, and OSH and YSH are also very high on my list of grammars to write.
It has taken a long time because so far I’ve almost always chosen to iterate on the architecture in order to gain capabilities on the tooling side before I risk the kind of lock-in that supporting popular languages could bring.
You left me out! Structural editors are about to take over everything we do day to day, and I’m making it happen.
My thesis: incremental parsing sucks.
Incremental(oops) error-tolerant parsing sucks because it creates tools which do not understand a user’s intent. Once intent is lost the results of further edits in the IDE are highly unpredictable and often damaging.This happens because once the input is unparsable an error recovering parser can only guess what might have been meant. This is the peak of “garbage in, garbage out”, a rule that people mysteriously forget in their excitement about error-tolerant parsers.
So on one hand you have error-recovering parsing, an unsolveable problem that requires you to correctly guess what someone was thinking without knowing. A non-starter. No point for serious-minded people to waste any of their time trying to solve a problem that is impossible.
On the other hand you have the problem of making a nice UX for cross-structural edits. This is not only possible, but there’s huge amounts known, written, and implemented about how to make tools with significant creative, expressive power.
Can you give us some links, please?
https://github.com/yairchu/awesome-structure-editors
Forest (paper) and Cursorless are two of my personal faves that have a primary interaction focus on keeping the edit flows working in a way that doesn’t involve textual box selection
Hazel is still preoccupied with text box selection, which is bizarre because they’ve invented the concept of an embedding gap: the single extremely rare thing in IDEs that you really need above all to escape the box selection paradigm (which even Forest and Cursorless lack despite their focus in that direction).
An example I use often is Javascript’s
if( )syntax. It’s an extremely common thing for people to intend insert this fragment of syntax with the intent to fill out the condition and the consequent. But there are two problems at first, the statement is completely unparseable since JS requires an expression between theifparens.Once you fix the basic syntax problem there the second problem happens: whatever statement happens to be below
if(condition)suddenly becomes the body of the if statement, making it conditional. There is no parser error at this moment, so error recovery can’t help! The code is fully valid, it is just out of sync with what the user was intending to do which we will say in this case was to fill in the body with a new statement.Hazel gives us the tools to deal with this by having gaps, which they represent as a mid dot ( · )
If you use this tool you can see how we can reverse the patterns and capture intent by inserting
if( · ) ·Now both problems are fixed! The document isn’t broken according to the rules of the language anymore (it’s just incomplete) and there’s no chance that the statement below will slide up into this node when we don’t want it to because we’ve created an appropriate placeholder. If it is the users intent to place the next statement as the body, it will be a triviality for them to express this by moving that statement into the body gap
Regex is another one I deal with all the time. Try typing
//as the empty regex and watch even the best error-recovering parser creates a comment and wrecks your program beyond sense. But using a gap like/·/would capture my intent in a way that the IDE wouldn’t be unavoidably forced to make a huge mess in the course of completely normal workBasically these gaps completely zero out the distance between the conceptual model used for textual programming and the block-with-hole model used by the best existing visual tools like https://scratch.mit.edu/ which are highly semantic but currently inefficient at representing real code because of their failure to embrace syntax as useful way of compactly viewing programs
Does anyone here actually use a structured/projectional editor as their everyday text editor? With the exception of Lisp programmers using paredit, I don’t think I’ve ever seen structured editing in the wild.
I’d argue that structured editing is extremely common, it’s just that we usually perform it using APIs as opposed to UIs. A linter, transpiler or codemod system is a kind of structure editor in that you interact with a semantic model of the program instead of the textual embedding in order to make changes
Well, sure, I guess, but I’m interested in knowing whether anyone uses a structured editor as their everyday text editor. No-one writes code with a linter or a transpiler: those are tools that you use to process code that you’ve already written with a text editor.
I always enjoyed using Paredit when writing Lisp but I don’t know of any other structured editor that’s widely used. I suppose Vim’s text objects are similar but not really the same. I saw the Ki Editor mentioned in @andyc’s comment and thought that looked cool: I wonder if anyone here has any experience using something like that.
Hey @c– I’m the author of Ki Editor, and Ki has been my main driver for almost a month after I’ve completely bootstrapped it from Neovim. It is sort of like a subset of Paredit, but not restricted to Lisp languages, so operations like Swap Sibling Nodes, Delete Current Node, Select Current Node, Go to first/last Node works as long as the tree-sitter grammar is defined for the language, for example, I use Ki daily for structural editing Swift, Rust, Python, and Typescript.
And no, structural editing in Ki is not an afterthought, merely supporting structural selection or operations is not enough, just like a Chinese restaurant serving spaghetti does not make it a Western restaurant. So even Neovim and Helix cannot be counted as a structural editor although they do support tree-sitter operations.
But then, structural editing is really not for everyone, it doesn’t click if you don’t know how syntax tree works.
Editing code in Ki is so efficient that I don’t know how to put it in terms, it makes Vim/Kakoune/Helix motions feel like a jumbled mess of keyboard shortcuts.
Thank you for replying here: that’s fantastic. I didn’t realize Ki is so new.
This is what I was wondering about. Vim’s text objects seemed amazing when I first understood them but they are very much a poor second to something like Paredit. I was wondering if, in practice, it is possible to get the Paredit experience with a non-parens-based syntax. It sounds like it is!
Go talk to them! They have a matrix room: https://matrix.to/#/#ki-editor:matrix.org
I also wrote about structural editing a few weeks ago: https://osa1.net/posts/2024-11-02-structural-editor.html.
I think we need a more structural editing solution, but we cannot go fully structured: that will require either inventing a new language (extremely difficult on its own), or supporting sometimes hundreds of types of expressions and statements in a structural way (not sure if possible conveniently).
What I propose in my blog post is a middle ground where you have structural editing at the top-most layers, starting from packages to statement/expression level, and from there you edit as before. I think it can be done, and should give us a lot of the benefits of structural editing.
I got around this by inventing a new language that isn’t a programming language but is rather a data description language not unlike HTML. Most people forget that structured editing is a solved problem on the internet to the extent that I’m using a live structure editor to compose this post right now.