Tangent/clarification: the regex slowness issue is almost certainly due to using Perl-style exponential time regexes (which are required by Sublime syntax definitions) rather than “regular languages”.
The syntect library mentioned uses fancy-regex, which is explicitly different than Rust’s default regex crate, in that it supports regex constructs that require exponential backtracking:
In particular, it uses backtracking to implement “fancy” features such as look-around and backtracking, which are not supported in purely NFA-based implementations (exemplified by RE2, and implemented in Rust in the regex crate).
I think “regexes” have gotten bad name so I call them “regular languages” now. That is, RE2, rust/regex, and libc are “regular language” engines while Perl/Python/PCRE/Java/etc. are “regex” engines.
In my experience, lexing with regular languages can beat hand written code. In the case of Oil vs. zsh, it’s beating hand-written C code by an order of magnitude (9-10x).
First, TextMate / Sublime style syntax highlighting is not really all that great. It is quite slow, largely because it grinds through a lot of regular expressions with captures
pedantic: I think the issue is probably more lookahead than captures ? You can do captures quickly in a “regular language” engine.
It may be surprising just how much slower regex-based highlighting is than fast parsers. The library that xi uses, syntect, is probably the fastest open source implementation in existence (the one in Sublime is faster but not open source). Even so, it is approximately 2500 times slower for parsing Markdown than pulldown-cmark.
fancy-regex only uses backtracking when “fancy” (principally, look-around and backreferences) features are used. Idk what proportion of syntax regexes actually use said fancy features though.
Also, I’m pretty sure syntect has only just recently switched to using fancy-regex. It was using Oniguruma (Ruby’s regex engine) for a long time. I haven’t been playing especially close attention, but IIRC, fancy-regex didn’t really improve things much. Not sure what’s going on, but it sounds potentially interesting. I definitely wouldn’t leap to the “it’s exponential” explanation immediately.
pedantic: I think the issue is probably more lookahead than captures ? You can do captures quickly in a “regular language” engine.
In the RE2 ancestry of regex engines, no, you cannot. Resolving capturing groups is a known sore point because it needs to use either the bounded backtracking or PikeVM regex engines, which are much slower (~order of magnitude) than a DFA in common implementations. One of my major goals over the next few years is to fix this.
In my experience, lexing with regular languages can beat hand written code. In the case of Oil vs. zsh, it’s beating hand-written C code by an order of magnitude (9-10x).
I think before throwing this around, one needs to consider the viability of using such a technique with syntax highlighting. I’ve never invested significant effort into syntax highlighting, but the two immediate problems that jump out at me are 1) the size of the state machines and 2) whether the lexer framework on its provides enough “power” for quality syntax highlighting.
I think there is a subtlety here – even if fancy-regex doesn’t do backtracking for regexes that don’t require it, the author of syntect points out here that a speed problem with Sublime’s parsing model is repeatedly reading the the same bytes.
So I say there that in an “regular language” engine you can just | the patterns together and avoid this (which I do in Oil). But that composition doesn’t hold or is more complex for Perl-style regexes.
However I haven’t actually done anything with it, and Oil’s lexer doesn’t use captures. (I am sort of curious to benchmark some Sublime-ish workloads with re2c but I probably won’t have the time).
Yes you need more power than regular languages to do syntax highlighting. String interpolation syntax, shell here docs, Rust/Lua multiline strings, etc. are a major reason.
In Oil I simply add a tiny layer of “lexer modes” on top to solve that problem:
It’s utterly trivial but solves all the lexing problems. So my point is that Sublime should not have based its parsing model on Perl-style regexes. It could have used regular languages plus some trivial state machine on top.
Vim syntax highlighting does something like that as far as I know. It goes line by line and I think there’s an extra “state” variable for each line, outside the regex state.
Right. But that blows up the size of the automaton. IIRC, the author of re2c is specifically on record as saying that these aren’t good techniques for general purpose regex engines like RE2 because of the size requirements. Remember, RE2 doesn’t do full DFA compilation. Now maybe it is good enough for the syntax highlighting use case. I don’t know. Maybe you can afford on-demand full compilation, but there are probably a of details to work out.
I don’t have enough knowledge about the syntax highlighting domain to debate the particulars unfortunately. I would guess that there are other trade offs. e.g., ease of writing syntax highlighting rules.
I remember reading from a pretty authoritative source (I can’t for the life of me remember where, but it was a deep dive into regexes by one of the creators/main implementers) that the algorithms you need to use to support the PCRE features drastically increase the runtime cost.
Yes. I wrote Rust’s regex engine. PCRE and its ilk are implemented with backtracking, which takes worst case exponential time. That isn’t really the issue here. Whether that’s the root cause of the performance problems requires a deeper analysis, and my instinct says “probably not, or at least, it’s only part of the story.”
People like to jump on time complexity guarantees very quickly, which tends to over simplify the matter. Sometimes it is indeed the right explanation. But in practice, there are many different kinds of performance cliffs in regexes.
Hm all my experience disagrees with this. I would say that computational complexity is essentially the only thing that the user of a regex engine cares about regarding performance (or should care about).
In other words, backtracking is the number one performance problem in practice with regexes and backtracking is a computational complexity issue.
Computational complexity isn’t sufficient analysis for all situations (e.g. https://news.ycombinator.com/item?id=23363165) but it’s never steered me wrong when dealing with regular expressions.
In other words, I don’t really care if it’s NFA or DFA. Both RE2 and rust/regex do submatch extraction without backtracking right? If so, that’s the high level bit that matters, and the rest are just details from a user’s perspective.
Yes I am probably sweeping a lot of your hard work under the rug :) I’m sure there are hundreds of things you know about regexes that I don’t because you’ve implemented a regex engine. But I still disagree with your assessment from a user perspective, based on years of experience in many applications.
Reviewing the benchmark I wrote last time [1], I remember I hit the 8MB hard-coded limit in RE2. But I still think that’s a non-issue for any application I know of. Its usage is proportional to the pattern size as far as I remember.
I’ve used a variety of regex engines in a variety of applications for 15+ years:
lexing programming languages in C, Python, and JavaScript (The lack of the sticky /y bit in ES3 caused regex-based lexers to be O(n^2), which was one reason I gave up programming JS for awhile)
big data, e.g. running MapReduce-style jobs over Google’s web indices. I used RE2 when it was first developed at Google and experienced its power firsthand.
I borrowed the trigram index technique Cox wrote about in one of my programs shortly after he wrote it (and before he published any of those articles)
years and years of shell and command line stuff (grep, sed, awk)
web server config files (Google had an internal network outage due to a backtracking regex in proxy.pac back in the day, a very common bug.)
Here are several examples:
The syntect example right here!!! If anything, this thread made me more confident in my opinion. The issue is backtracking because alternation isn’t compiled into a DFA or NFA.
The main reason is that the parsing model is applying a whole bunch of unanchored regexes to the unparsed remainder of the line one after another until one matches, then starting again for the next token. This means each token parsed can require dozens of regex matches over the same characters.https://news.ycombinator.com/item?id=23665134
Both the Sixty and ShellCheck parsers moved away from parser combinators to avoid its byte-by-byte backtracking. See the measured performance improvements after the algorithmic change.
The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking. This is essentially the same bug that happened at Google and dozens of other places. It happens to everyone who relies on Perl-style engines to match URLs (and URLs are not big data! Whenever you have a performance problem involving 1024 bytes of data, you have a computational complexity problem :) )
If anyone disagrees, I would like to hear specific examples of:
applications where the regex gave you linear-time performance (no backtracking), but the performance still wasn’t good enough. I don’t know of any. (I guess there is the DPI stuff by Hyperscan, though again I think one of their main wins is to compile thousands of patterns into a single DFA, much like what I used RE2 for with big data)
applications where backtracking was applied to big data and didn’t blow up in your face after years of usage.
Oof. I’m sorry, but I just don’t have the energy to do this back and forth with you again. The last time we did this, I felt like we were talking at cross purposes for a long time, and I’m getting that feeling again.
I’ll just leave it at this: the HN comment you’re linking to isn’t strong evidence of your conclusion IMO. More details are needed.
Hyperscan, though again I think one of their main wins is to compile thousands of patterns into a single DFA
No. Their main wins were the development of several novel SIMD algorithms that have good scaling properties and are otherwise exceptionally clever. They certainly do not generate one giant DFA. They’ve written papers about it.
I’ve also been reluctant to engage this thread, because it reminds me of an LL(1) recursive descent parser before left-factoring the grammar. But I can’t resist going into what my kids call “lecture mode” for a bit.
Andy seems to be hyperfocusing on one aspect of the parsing task, backtracking, while there’s little evidence that this is the main drawback to TextMate style highlighting. He seems to be proposing a highlighting scheme which is generally similar to TextMate but has faster low-level lexing based on actual regular languages. I fully grant that such a thing is possible, but find it an uninteresting point in the space of highlighting engines.
First, the Perl-style regexes are actually quite useful for handling a variety of important real-world lexing tasks. One of my favorite examples is Rust raw strings, which cannot be parsed by a regular language, but are easily expressed by backreferences. If you take these away, you’ll need to add another mechanism to restore the lost power.
But let’s take a step back. A good highlighting engine should fulfill (roughly) these goals:
Fast
Accurate
Incremental
Easy to write syntax definitions
Support embedded sublanguages
TextMate scores poorly on the first, meh on the second, and pretty good on the last three (the incrementality comes from line-at-a-time processing, which impacts accuracy). Andy has acknowledged that pure regular languages are inadequate, and proposes “lexer modes”, which I understand to be a stack. TextMate also has a stack, with explicit “push” and “pop” operations.
So let’s work from principles. Obviously we want to touch every byte of the input once, in a single pass. A regular language gets us most of the lexical structure, but we also need a stack to capture more of the grammar, and also handle embedded sublanguages. Can we invent a parsing scheme that somehow combines these two?
The answer is of course that there is a rich and deep literature on exactly this topic, we’ve just described an LR parser, and the invention was done by Knuth in 1965. Pure LR parsing is not powerful enough to handle real languages, but there’s been a lot of work on how to make it more powerful, including GLR. Even better, we know how to make such a parser incremental, thanks to Tim Wagner’s PhD and related work. Writing such a parser by hand would be tedious, but thankfully there is also a rich literature of parser generators (of which lex and yacc are classical, but there have been improvements since).
So now we’ve basically described tree-sitter, which is shipping in Atom and Github, and has been reimplemented as Lezer, which is shipping in CodeMirror. There are a few dozen grammars, covering the most popular languages (as opposed to hundreds for TextMate).
So. How would a “regular language” variant of TextMate stack up? I’ll leave that to the reader.
Obviously we want to touch every byte of the input once, in a single pass.
The answer is of course that there is a rich and deep literature on exactly this topic …
We’re in 100% agreement. I’ve watched all the treesitter talks (and I put a TODO on my list to write a R grammar using it, which I unfortunately never got around to).
My point is the same one that RSC made in his articles (which are awesome but I think too dense to sink into a general programming audience):
And Treesitter is a great example of that. On the other hand, matching 20 Perl-style regexes in sequence at the same character position is a counterexample. That is my point, and I’m not sure why that would be controversial to state!
Lexer modes are another possible model to maintain linear time and get more power, but there are many possible models to drive them, e.g. another state machine or something more complex. Treesitter has some related lexing mechanisms apart from the GLR stuff to deal with the same problems: https://tree-sitter.github.io/tree-sitter/creating-parsers#lexical-analysis
Pretty much all such parser generators have grown some ad hoc lexing mechanisms (Flex lexical states, etc.) to solve the problems that kind of fall in the gap between theory and practice.
What I remember is that I was pointing out a computational complexity issue (scaling with respect to number of “simultaneous” patterns, not length of input) – and I observed it in practice on ripgrep:
And I wasn’t saying that you should be doing something differently. My point was that Aho-Corasick can fall out for free in a regex engine, and that appears to hold up in practice, with some caveats I noted there (and in the original thread).
No. Their main wins were the development of several novel SIMD algorithms that have good scaling properties and are otherwise exceptionally clever. They certainly do not generate one giant DFA. They’ve written papers about it.
I went back and looked at the paper. They don’t make a single DFA, but they are absolutely solving the scaling with respect to patterns problem. That is their #1 claim in the abstract. The SIMD stuff is the #2 claim.
In this paper, we present Hyperscan, a high performance regex matching system that exploits regex decomposition as the first principle. Regex decomposition splits a regex pattern into a series of disjoint string and FA components
I am solving it here using the straightforward method of alternation (notice all the constant strings and regexes):
So they are doing better by taking advantage of the fact that some strings are constant. They are building on top of the “read every byte once” O(n) property. Getting the computational complexity right is table stakes. (in this case, scaling with respect to patterns, not length of input)
That applies directly to syntect. The author claims it’s slower than SublimeText’s engine because it doesn’t do the alternation efficiently:
the reason Sublime’s built-in highlighter is the fastest is they wrote a custom regex engine which basically does the equivalent of or-ing them together but unlike all other regex engines handles captures and information about which one matched properly while doing so. The custom engine doesn’t do backtracking
What I remember is that I was pointing out a computational complexity issue (scaling with respect to number of “simultaneous” patterns, not length of input) – and I observed it in practice on ripgrep:
We’ve been through this soooooo many times. Our last conversation on this was me trying over and over again to explain to you that RE2 and its ilk do not use DFAs. They use a combination of NFA simulation (“PikeVM”), bounded backtracking (“bitstate backtracker”) and a hybrid NFA/DFA (“lazy DFA” and sometimes confusingly just called “DFA”) that computes the DFA at match time. None of those techniques scale with the number of regexes. This is by design and this is why re2c’s match time performance does scale: it builds full DFAs. This is also a niche where Hyperscan excels, because it specifically targets the niche of building large pattern databases. The Hyperscan folks have benchmarked the scaling properties of Hyperscan vs RE2, and Hyperscan totally destroys RE2. This is an amazing engineering accomplishment, but neither the Hyperscan nor RE2 authors (I assume, I wasn’t) were surprised by this result.
re2c occupies a really important niche, and that’s the niche where you can not only afford to build mammoth sized DFAs, but that you can do it in an offline process. Neither of these things are true in the context where something RE2 operates, and the dramatic effect this has on the implementation of the regex engine cannot be understated.
Listen, I’ve written the regex engine. I don’t need to be told its computational properties. I know what they are. I wouldn’t normally hold this against anyone, but we’ve had conversations before, and it still feels like you’re trying to lecture me about this shit. Years ago, I set out with the explicit design decision that all matching should be linear in the size of the input when I built the regex crate.
This is why we are talking at cross purposes. This whole thing started out with me expressing a little bit of skepticism at a leap towards blaming computational complexity for the performance problems of syntect. That’s it. Just look at what I said:
Also, I’m pretty sure syntect has only just recently switched to using fancy-regex. It was using Oniguruma (Ruby’s regex engine) for a long time. I haven’t been playing especially close attention, but IIRC, fancy-regex didn’t really improve things much. Not sure what’s going on, but it sounds potentially interesting. I definitely wouldn’t leap to the “it’s exponential” explanation immediately.
There’s a lot of circumspection there. Instead of just saying, “Yeah good point, more investigation needed,” you’ve snowballed this into some crazy conversation that makes it look like I’m somehow denying the importance of computational properties. When in reality, the last several years of my life have been dedicated to a project that affirms precisely the opposite. Please do not mistake my rigid adherence to trade offs—all regex engines have them—for some kind of ignorance of basic facts. I mean, the fact that you’re linking rsc’s articles in this conversation with two people who have actually built regex engines that are direct descendants of RE2 is just… whacky. As if I haven’t read those articles dozens of times. As if anyone seriously involved in building regex engines hasn’t read those articles.
On the other hand, matching 20 Perl-style regexes in sequence at the same character position is a counterexample. That is my point, and I’m not sure why that would be controversial to state!
No, it isn’t. You’ve taken a small amount of skepticism on my part for a very specific problem diagnosis and blown it up into general statement about computational complexity.
My point was that Aho-Corasick can fall out for free in a regex engine
It does not. We’ve been over this. Aho-Corasick is solving a more constrained problem and therefore has both a simpler and faster construction process. The only way “Aho-Corasick can fall out for free in a regex engine” is true is if your regex engine compiles full DFAs ahead of time. That’s true for re2c, but not true for any production grade general purpose regex engine that I know of.
I went back and looked at the paper. They don’t make a single DFA, but they are absolutely solving the scaling with respect to patterns problem. That is their #1 claim in the abstract. The SIMD stuff is the #2 claim.
Yes, I explicitly mentioned scaling. Hyperscan does not have the scaling properties of a single large DFA. There is a point at which a DFA compiled by re2c will be faster than Hyperscan. I don’t know what that point is or if it’s even practical to reach, but it almost certainly exists.
That applies directly to syntect. The author claims it’s slower than SublimeText’s engine because it doesn’t do the alternation efficiently
Yes, that’s one piece of the puzzle. But capture groups and the extent to which “fancy” features are being used inside the syntax highlighting regex definitions are some of the other pieces.
Listen, I love your work and I love your blog posts. I am positive we could have some really stellar conversations. I bet it would be much better if we were talking in person with a whiteboard, since I’m sure there is a lot being lost in this sort of online discourse. But please, in the future, please please try not to blow things way out of proportion. If you’ve found yourself in a place when I am somehow denying basic computational facts, then please back up and reconsider how you got there. It’s almost certainly not the case.
Yes we are talking at cross purposes. I’m making a very specific claim that should help users with regexes (based on 15+ years of using multiple engines in all sorts of applications), and you’re talking from the implementer’s perspective.
The claim is often ignored and shows up in practice.
I believe you’re making it more complicated than it needs to be for users. Computational complexity is the thing that matters for users of regexes regarding performance. I challenge someone to point out specific applications and use cases where that’s not the case – I gave a whole bunch of examples.
If you’re not naming those use cases, then you’re not getting my point.
Writing the regex engine doesn’t mean you’ve used it in a wide variety of applications. I explicitly said you know hundreds of things about regexes that I don’t – I’m pointing out the high-level bits for users.
And you are contradicting yourself, saying it does not “fall out”, except when it does fall out for re2c. I’m not saying ripgrep should scale like this – I’m pointing out a difference in theory that’s observed in practice.
I think you are trying to play “the expert” because you wrote it. But ironically users often have different insights into software than the author. That is why I write all these tests, benchmarks, and benchmarks against Oil. Simply writing the software doesn’t give me complete authority on it – it has to be observed from the outside, which tests and benchmarks accomplish.
Moreoever, it has to be situated in applications. I specifically look at how users use the shell, e.g. Linux distros and autocompletion scripts. Observing it from the inside isn’t enough. It gives you a limited view of the software, even if you wrote every line of it.
Also, Egg expressions are explicitly designed to help users with this problem. They make syntax and semantics correspond (to the degree possible), and computational complexity is essentially the most important semantic aspect of regular expressions. Linear time and constant space is why I use them, and it improves real software (Oil’s parser vs. bash and zsh).
In other words, I’m not making a big deal out of it just for fun, or because I have a beef with you. It’s part of the Oil project and something I’ve been working on for many years.
I am not a moderator on lobste.rs, but if I were, I would have a word with you. Accusing burntsushi of “playing the expert” is hostile and aggressive, and completely unwarranted. We are gifted that he is willing to share his deep study and understanding with us, both as world-class libraries and in conversation.
Computational complexity is the thing that matters for users of regexes regarding performance.
Not only have I never disputed this, I’ve never said anything that even remotely comes close to disputing it. Like I said, you’re generalizing my comments to give it a lot more scope than what I had actually written. To be clear, I absolutely agree with this generalized advice to end users. That’s why I wrote a regex engine that descends from RE2.
I don’t really know what to say to the rest of your comment. But if you’re only point is that “understanding computational complexity is really important for end users writing regexes,” then I’m going say, “yeah, so? What in the heck does that have to do with what I said?”
I don’t know what it is I’ve said that has caused you to dig into this rabbit hole. I really don’t. I just re-read my comments and I just cannot understand how we got here.
And I don’t mean to “play the expert” as in “respect my authority.” I brought up my credentials as if to say, “no, actually, you don’t need to explain the basic computational properties of regexes to me.”
Yeah I think we’re disagreeing on the intuition. And while I respect your unique viewpoint as the implementer of an engine (and acknowledge I learned many things from the discussion), I disagree with a lot of the intuition you laid out.
At least 3 times in this discussion when I go back to the sources, my intuition has panned out. I may have said some things that are incorrect around the edges, but they turned out to be more correct than the nitpicking.
You and other seems to be eager to nitpick when I offer general advice. The advice is aimed at casual users of regex who may get the false impression that they are slow. My point is that they are applicable to more situations than most people think (including syntax highlighters, used appropriately).
Examples:
I suspected backtracking was the issue. A few people rejected it. When I asked for clarification (and I was open to new information) it was stated that the reason that Sublime’s engine is faster than the syntect one is because it doesn’t backtrack outside the engine. That’s a different form of backtracking, so sure you can jump on it, but I already acknowledged the difference. I think we both acknowledge the same possibilities but our intuition is different, and you seem to think yours is more correct.
When I say “fast” capture matching, I meant linear time. Again you’re eager to say “no you can’t do that”. I’m still interested to hear the applications where linear time matching is too slow. I never saw a text editor where anyone complains about editing small files. Even the worst web-based editor can edit a small file. It’s always complaining about editing big files, which indicates a scaling problems. [1]
I was technically wrong about Hyperscan building a DFA, and you’re eager to jump on that. But when I went back to the source the #1 claim is simultaneously matching patterns and constant strings, so again I think my intuition is more correct than the nitpicking. It’s similar to stuff I’ve done before, but I didn’t go look at the details.
I didn’t go back and read the old lobste.rs conversation, but I’m pretty sure you didn’t think the scaling I showed with re2c was actually implemented anywhere. In other words I had to go do it to convince you.
So basically we’re just thinking about it from different perspectives. I think your perspective is valuable but your intuitions don’t line up with mine, and that doesn’t mean I’m wrong.
[1] To give another example of this, I recall Rob Pike complaining 10+ years ago that GNU grep is 10x too slow when LANG!=C. It was a longstanding performance bug in one of the most deployed programs in the world. But basically nobody cares, and the bug persisted because it’s still linear time. I’m still interested in applications where people care about a 10x constant factor. I’m sure some exist but it’s not something I’ve hit in 15 years of using many engines.
Most software is slow, and regexes are fast. I’m thinking about it from a systems perspective, where you get 10x constant slowdowns all over the place, in multiple components.
Using Python is an order of magnitude constant factor slower. It matters in some systems, and doesn’t matter in others. The nitpicking there, which is missing the point, is reminiscent of this thread!
In both cases, I’m offering people rules of thumb for performance, based on experience. You can always nitpick a rule of thumb. It doesn’t mean it’s useful to do so, or even directionally correct. I’m happy to acknowledge inaccuracies as I have in every case.
When I say “fast” capture matching, I meant linear time. Again you’re eager to say “no you can’t do that”.
Ug. No I didn’t. I didn’t nitpick any rules of thumb. I was commenting on one specific instance.
I am specifically going to avoid talking to you about regexes from now on. Conversing with you has been by far the most unpleasant part of my day. I’m not going to repeat that mistake.
No, I meant that a lot of the algorithms fundamentally add time-cost to the other operations, too. In retrospect it sounds incorrect, I wish I could remember where I read it.
It depends on the implementation. In the case of fancy-regex, it is a true hybrid. If there are no fancy features, then it defers to a fully finite automaton based regex engine. In the case of other regex engines, it depends. For a simple alternation of literals, any regex engine could pretty easily just use Aho-Corasick to avoid the much slower backtracking implementation. As things get more complex, if you aren’t a true hybrid, then yes, you will definitely pay for it.
There is give and take here. I hope to some day have a reasonably comprehensive regex benchmark where I could just point you to examples and make this conversation much simpler. :)
This is reall. I wish it was more common to write retrospectives when you move on from a project. It would massively simplify the “is this project still going or has everyone moved on” problem we often see in community software.
I remember Xi being discussed here maybe 2 years ago, and the author was even here on Lobsters. I believe it was a YouTube video demonstrating and explaining the technologies used by the project, and one of the claims was “it was an editor for the next 20 years”. And while the author answered a lot of (technical) questions asked elsewhere, he for whatever reason avoided me asking “why the next 20 years” and what would make it interesting from a non-technical perspective. Sure speed is cool, but considering that VSCode is currently gobbling up the market and it’s based on a web browser, I think people look for more than “just” good data structures – after all, if all goes well that’s the part nobody notices.
I touched on this a bit in the post. At the time I was preparing for that talk, my idea was that we would systematically work out protocols and interfaces to coordinate the modules so the result would be an excellent experience. Between the time I pitched the talk and when I gave it, my focus turned to performance issues in the underlying UI toolkit, and my thinking on the viability of those async-heavy interfaces also evolved, pretty much as I described in the post.
i am genuinely wondering if concurrent modification is really, truly needed ? for reference feel free to look at the crdt-retrospective that is linked in this article.
So “needed” is really a function of your requirements. If you’re trying to build Google Docs, then obviously yes. If you’re doing simple editing of plain text files, then obviously no.
The interesting case, I think, is language servers (and perhaps other similar services). They’re slow enough that blocking synchronously on them would be a horrible experience. Thus, you’re in a situation when there’s a concurrent edit by the user and an “edit” of sorts (basically adding annotations) by the language server. If you don’t pay attention to concurrency issues, then you’ll have all kinds of issues around annotating incorrect regions in the text. I do believe this is a problem today and that it’s possible to do better.
The question, then, is what kind of concurrency mechanism. What xi explored was the idea of building a completely general mechanism based on CRDT, which has a beautiful mathematical theory behind it. This did not work well for two reasons. First, while CRDT handles some use cases well (syntax highlighting), for others it is a poor fit. It might be possible to figure out how to express auto-indentation (and other similar features like “soft spans”) robustly in the CRDT framework, but we didn’t figure out how to do that.
Second, given what I now know of the OT and CRDT space, it is clear that a general mechanism is basically a dream, not reality. When you have a server of some kind doing automated edits, you want to treat those differently from edits from a human user. In particular, in the case of conflict (and conflict is inevitable), you generally want to discard the automated edits.
Coming back to the specific question of auto-indentation. The original plan was to use TextMate syntax definitions. These can sometimes be slow (this is basically a tradeoff for having access to hundreds of definitions), so the goal was to not interrupt typing, perhaps fixing up indentation with a small delay. But in retrospect, and looking at the overall system design, it seems like a better choice is to use an incremental parser (like tree-sitter or Lezer) that is so fast that it is possible to apply its auto-indentation results synchronously without degrading the interactive experience.
Tangent/clarification: the regex slowness issue is almost certainly due to using Perl-style exponential time regexes (which are required by Sublime syntax definitions) rather than “regular languages”.
The syntect library mentioned uses fancy-regex, which is explicitly different than Rust’s default regex crate, in that it supports regex constructs that require exponential backtracking:
https://github.com/fancy-regex/fancy-regex
In particular, it uses backtracking to implement “fancy” features such as look-around and backtracking, which are not supported in purely NFA-based implementations (exemplified by RE2, and implemented in Rust in the regex crate).
I think “regexes” have gotten bad name so I call them “regular languages” now. That is, RE2, rust/regex, and libc are “regular language” engines while Perl/Python/PCRE/Java/etc. are “regex” engines.
In my experience, lexing with regular languages can beat hand written code. In the case of Oil vs. zsh, it’s beating hand-written C code by an order of magnitude (9-10x).
http://www.oilshell.org/blog/2020/01/parser-benchmarks.html
https://lobste.rs/s/77nu3d/oil_s_parser_is_160x_200x_faster_than_it_was
First, TextMate / Sublime style syntax highlighting is not really all that great. It is quite slow, largely because it grinds through a lot of regular expressions with captures
pedantic: I think the issue is probably more lookahead than captures ? You can do captures quickly in a “regular language” engine.
It may be surprising just how much slower regex-based highlighting is than fast parsers. The library that xi uses, syntect, is probably the fastest open source implementation in existence (the one in Sublime is faster but not open source). Even so, it is approximately 2500 times slower for parsing Markdown than pulldown-cmark.
(copy of HN comment)
fancy-regex only uses backtracking when “fancy” (principally, look-around and backreferences) features are used. Idk what proportion of syntax regexes actually use said fancy features though.
Also, I’m pretty sure syntect has only just recently switched to using fancy-regex. It was using Oniguruma (Ruby’s regex engine) for a long time. I haven’t been playing especially close attention, but IIRC, fancy-regex didn’t really improve things much. Not sure what’s going on, but it sounds potentially interesting. I definitely wouldn’t leap to the “it’s exponential” explanation immediately.
In the RE2 ancestry of regex engines, no, you cannot. Resolving capturing groups is a known sore point because it needs to use either the bounded backtracking or PikeVM regex engines, which are much slower (~order of magnitude) than a DFA in common implementations. One of my major goals over the next few years is to fix this.
I think before throwing this around, one needs to consider the viability of using such a technique with syntax highlighting. I’ve never invested significant effort into syntax highlighting, but the two immediate problems that jump out at me are 1) the size of the state machines and 2) whether the lexer framework on its provides enough “power” for quality syntax highlighting.
I think there is a subtlety here – even if fancy-regex doesn’t do backtracking for regexes that don’t require it, the author of syntect points out here that a speed problem with Sublime’s parsing model is repeatedly reading the the same bytes.
https://news.ycombinator.com/item?id=23665341
So I say there that in an “regular language” engine you can just | the patterns together and avoid this (which I do in Oil). But that composition doesn’t hold or is more complex for Perl-style regexes.
About captures I think the last time we talked about it I pointed to this re2c paper: https://arxiv.org/abs/1907.08837
However I haven’t actually done anything with it, and Oil’s lexer doesn’t use captures. (I am sort of curious to benchmark some Sublime-ish workloads with re2c but I probably won’t have the time).
Yes you need more power than regular languages to do syntax highlighting. String interpolation syntax, shell here docs, Rust/Lua multiline strings, etc. are a major reason.
In Oil I simply add a tiny layer of “lexer modes” on top to solve that problem:
http://www.oilshell.org/blog/2017/12/17.html
It’s utterly trivial but solves all the lexing problems. So my point is that Sublime should not have based its parsing model on Perl-style regexes. It could have used regular languages plus some trivial state machine on top.
Vim syntax highlighting does something like that as far as I know. It goes line by line and I think there’s an extra “state” variable for each line, outside the regex state.
Right. But that blows up the size of the automaton. IIRC, the author of re2c is specifically on record as saying that these aren’t good techniques for general purpose regex engines like RE2 because of the size requirements. Remember, RE2 doesn’t do full DFA compilation. Now maybe it is good enough for the syntax highlighting use case. I don’t know. Maybe you can afford on-demand full compilation, but there are probably a of details to work out.
I don’t have enough knowledge about the syntax highlighting domain to debate the particulars unfortunately. I would guess that there are other trade offs. e.g., ease of writing syntax highlighting rules.
I remember reading from a pretty authoritative source (I can’t for the life of me remember where, but it was a deep dive into regexes by one of the creators/main implementers) that the algorithms you need to use to support the PCRE features drastically increase the runtime cost.
Yes. I wrote Rust’s regex engine. PCRE and its ilk are implemented with backtracking, which takes worst case exponential time. That isn’t really the issue here. Whether that’s the root cause of the performance problems requires a deeper analysis, and my instinct says “probably not, or at least, it’s only part of the story.”
People like to jump on time complexity guarantees very quickly, which tends to over simplify the matter. Sometimes it is indeed the right explanation. But in practice, there are many different kinds of performance cliffs in regexes.
Hm all my experience disagrees with this. I would say that computational complexity is essentially the only thing that the user of a regex engine cares about regarding performance (or should care about).
In other words, backtracking is the number one performance problem in practice with regexes and backtracking is a computational complexity issue.
Computational complexity isn’t sufficient analysis for all situations (e.g. https://news.ycombinator.com/item?id=23363165) but it’s never steered me wrong when dealing with regular expressions.
In other words, I don’t really care if it’s NFA or DFA. Both RE2 and rust/regex do submatch extraction without backtracking right? If so, that’s the high level bit that matters, and the rest are just details from a user’s perspective.
Yes I am probably sweeping a lot of your hard work under the rug :) I’m sure there are hundreds of things you know about regexes that I don’t because you’ve implemented a regex engine. But I still disagree with your assessment from a user perspective, based on years of experience in many applications.
Reviewing the benchmark I wrote last time [1], I remember I hit the 8MB hard-coded limit in RE2. But I still think that’s a non-issue for any application I know of. Its usage is proportional to the pattern size as far as I remember.
https://news.ycombinator.com/item?id=23665569
I’ve used a variety of regex engines in a variety of applications for 15+ years:
Here are several examples:
If anyone disagrees, I would like to hear specific examples of:
Oof. I’m sorry, but I just don’t have the energy to do this back and forth with you again. The last time we did this, I felt like we were talking at cross purposes for a long time, and I’m getting that feeling again.
I’ll just leave it at this: the HN comment you’re linking to isn’t strong evidence of your conclusion IMO. More details are needed.
No. Their main wins were the development of several novel SIMD algorithms that have good scaling properties and are otherwise exceptionally clever. They certainly do not generate one giant DFA. They’ve written papers about it.
I’ve also been reluctant to engage this thread, because it reminds me of an LL(1) recursive descent parser before left-factoring the grammar. But I can’t resist going into what my kids call “lecture mode” for a bit.
Andy seems to be hyperfocusing on one aspect of the parsing task, backtracking, while there’s little evidence that this is the main drawback to TextMate style highlighting. He seems to be proposing a highlighting scheme which is generally similar to TextMate but has faster low-level lexing based on actual regular languages. I fully grant that such a thing is possible, but find it an uninteresting point in the space of highlighting engines.
First, the Perl-style regexes are actually quite useful for handling a variety of important real-world lexing tasks. One of my favorite examples is Rust raw strings, which cannot be parsed by a regular language, but are easily expressed by backreferences. If you take these away, you’ll need to add another mechanism to restore the lost power.
But let’s take a step back. A good highlighting engine should fulfill (roughly) these goals:
TextMate scores poorly on the first, meh on the second, and pretty good on the last three (the incrementality comes from line-at-a-time processing, which impacts accuracy). Andy has acknowledged that pure regular languages are inadequate, and proposes “lexer modes”, which I understand to be a stack. TextMate also has a stack, with explicit “push” and “pop” operations.
So let’s work from principles. Obviously we want to touch every byte of the input once, in a single pass. A regular language gets us most of the lexical structure, but we also need a stack to capture more of the grammar, and also handle embedded sublanguages. Can we invent a parsing scheme that somehow combines these two?
The answer is of course that there is a rich and deep literature on exactly this topic, we’ve just described an LR parser, and the invention was done by Knuth in 1965. Pure LR parsing is not powerful enough to handle real languages, but there’s been a lot of work on how to make it more powerful, including GLR. Even better, we know how to make such a parser incremental, thanks to Tim Wagner’s PhD and related work. Writing such a parser by hand would be tedious, but thankfully there is also a rich literature of parser generators (of which lex and yacc are classical, but there have been improvements since).
So now we’ve basically described tree-sitter, which is shipping in Atom and Github, and has been reimplemented as Lezer, which is shipping in CodeMirror. There are a few dozen grammars, covering the most popular languages (as opposed to hundreds for TextMate).
So. How would a “regular language” variant of TextMate stack up? I’ll leave that to the reader.
Thanks for that summary. :-) Appreciate it!
We’re in 100% agreement. I’ve watched all the treesitter talks (and I put a TODO on my list to write a R grammar using it, which I unfortunately never got around to).
My point is the same one that RSC made in his articles (which are awesome but I think too dense to sink into a general programming audience):
With respect to regexes, using good theory leads to good programs. https://swtch.com/~rsc/regexp/regexp1.html
And Treesitter is a great example of that. On the other hand, matching 20 Perl-style regexes in sequence at the same character position is a counterexample. That is my point, and I’m not sure why that would be controversial to state!
Lexer modes are another possible model to maintain linear time and get more power, but there are many possible models to drive them, e.g. another state machine or something more complex. Treesitter has some related lexing mechanisms apart from the GLR stuff to deal with the same problems: https://tree-sitter.github.io/tree-sitter/creating-parsers#lexical-analysis
Pretty much all such parser generators have grown some ad hoc lexing mechanisms (Flex lexical states, etc.) to solve the problems that kind of fall in the gap between theory and practice.
What I remember is that I was pointing out a computational complexity issue (scaling with respect to number of “simultaneous” patterns, not length of input) – and I observed it in practice on ripgrep:
https://news.ycombinator.com/item?id=23665569
And I wasn’t saying that you should be doing something differently. My point was that Aho-Corasick can fall out for free in a regex engine, and that appears to hold up in practice, with some caveats I noted there (and in the original thread).
I went back and looked at the paper. They don’t make a single DFA, but they are absolutely solving the scaling with respect to patterns problem. That is their #1 claim in the abstract. The SIMD stuff is the #2 claim.
https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf
In this paper, we present Hyperscan, a high performance regex matching system that exploits regex decomposition as the first principle. Regex decomposition splits a regex pattern into a series of disjoint string and FA components
I am solving it here using the straightforward method of alternation (notice all the constant strings and regexes):
https://www.oilshell.org/release/0.8.pre6/source-code.wwz/_devbuild/tmp/osh-lex.re2c.h
So they are doing better by taking advantage of the fact that some strings are constant. They are building on top of the “read every byte once” O(n) property. Getting the computational complexity right is table stakes. (in this case, scaling with respect to patterns, not length of input)
That applies directly to syntect. The author claims it’s slower than SublimeText’s engine because it doesn’t do the alternation efficiently:
https://news.ycombinator.com/item?id=23665778
the reason Sublime’s built-in highlighter is the fastest is they wrote a custom regex engine which basically does the equivalent of or-ing them together but unlike all other regex engines handles captures and information about which one matched properly while doing so. The custom engine doesn’t do backtracking
We’ve been through this soooooo many times. Our last conversation on this was me trying over and over again to explain to you that RE2 and its ilk do not use DFAs. They use a combination of NFA simulation (“PikeVM”), bounded backtracking (“bitstate backtracker”) and a hybrid NFA/DFA (“lazy DFA” and sometimes confusingly just called “DFA”) that computes the DFA at match time. None of those techniques scale with the number of regexes. This is by design and this is why re2c’s match time performance does scale: it builds full DFAs. This is also a niche where Hyperscan excels, because it specifically targets the niche of building large pattern databases. The Hyperscan folks have benchmarked the scaling properties of Hyperscan vs RE2, and Hyperscan totally destroys RE2. This is an amazing engineering accomplishment, but neither the Hyperscan nor RE2 authors (I assume, I wasn’t) were surprised by this result.
re2c occupies a really important niche, and that’s the niche where you can not only afford to build mammoth sized DFAs, but that you can do it in an offline process. Neither of these things are true in the context where something RE2 operates, and the dramatic effect this has on the implementation of the regex engine cannot be understated.
Listen, I’ve written the regex engine. I don’t need to be told its computational properties. I know what they are. I wouldn’t normally hold this against anyone, but we’ve had conversations before, and it still feels like you’re trying to lecture me about this shit. Years ago, I set out with the explicit design decision that all matching should be linear in the size of the input when I built the
regex
crate.This is why we are talking at cross purposes. This whole thing started out with me expressing a little bit of skepticism at a leap towards blaming computational complexity for the performance problems of syntect. That’s it. Just look at what I said:
There’s a lot of circumspection there. Instead of just saying, “Yeah good point, more investigation needed,” you’ve snowballed this into some crazy conversation that makes it look like I’m somehow denying the importance of computational properties. When in reality, the last several years of my life have been dedicated to a project that affirms precisely the opposite. Please do not mistake my rigid adherence to trade offs—all regex engines have them—for some kind of ignorance of basic facts. I mean, the fact that you’re linking rsc’s articles in this conversation with two people who have actually built regex engines that are direct descendants of RE2 is just… whacky. As if I haven’t read those articles dozens of times. As if anyone seriously involved in building regex engines hasn’t read those articles.
No, it isn’t. You’ve taken a small amount of skepticism on my part for a very specific problem diagnosis and blown it up into general statement about computational complexity.
It does not. We’ve been over this. Aho-Corasick is solving a more constrained problem and therefore has both a simpler and faster construction process. The only way “Aho-Corasick can fall out for free in a regex engine” is true is if your regex engine compiles full DFAs ahead of time. That’s true for re2c, but not true for any production grade general purpose regex engine that I know of.
Yes, I explicitly mentioned scaling. Hyperscan does not have the scaling properties of a single large DFA. There is a point at which a DFA compiled by re2c will be faster than Hyperscan. I don’t know what that point is or if it’s even practical to reach, but it almost certainly exists.
Yes, that’s one piece of the puzzle. But capture groups and the extent to which “fancy” features are being used inside the syntax highlighting regex definitions are some of the other pieces.
Listen, I love your work and I love your blog posts. I am positive we could have some really stellar conversations. I bet it would be much better if we were talking in person with a whiteboard, since I’m sure there is a lot being lost in this sort of online discourse. But please, in the future, please please try not to blow things way out of proportion. If you’ve found yourself in a place when I am somehow denying basic computational facts, then please back up and reconsider how you got there. It’s almost certainly not the case.
Yes we are talking at cross purposes. I’m making a very specific claim that should help users with regexes (based on 15+ years of using multiple engines in all sorts of applications), and you’re talking from the implementer’s perspective.
The claim is often ignored and shows up in practice.
I believe you’re making it more complicated than it needs to be for users. Computational complexity is the thing that matters for users of regexes regarding performance. I challenge someone to point out specific applications and use cases where that’s not the case – I gave a whole bunch of examples.
If you’re not naming those use cases, then you’re not getting my point.
Writing the regex engine doesn’t mean you’ve used it in a wide variety of applications. I explicitly said you know hundreds of things about regexes that I don’t – I’m pointing out the high-level bits for users.
And you are contradicting yourself, saying it does not “fall out”, except when it does fall out for re2c. I’m not saying ripgrep should scale like this – I’m pointing out a difference in theory that’s observed in practice.
I think you are trying to play “the expert” because you wrote it. But ironically users often have different insights into software than the author. That is why I write all these tests, benchmarks, and benchmarks against Oil. Simply writing the software doesn’t give me complete authority on it – it has to be observed from the outside, which tests and benchmarks accomplish.
https://www.oilshell.org/release/0.8.pre6/
Moreoever, it has to be situated in applications. I specifically look at how users use the shell, e.g. Linux distros and autocompletion scripts. Observing it from the inside isn’t enough. It gives you a limited view of the software, even if you wrote every line of it.
Also, Egg expressions are explicitly designed to help users with this problem. They make syntax and semantics correspond (to the degree possible), and computational complexity is essentially the most important semantic aspect of regular expressions. Linear time and constant space is why I use them, and it improves real software (Oil’s parser vs. bash and zsh).
In other words, I’m not making a big deal out of it just for fun, or because I have a beef with you. It’s part of the Oil project and something I’ve been working on for many years.
I am not a moderator on lobste.rs, but if I were, I would have a word with you. Accusing burntsushi of “playing the expert” is hostile and aggressive, and completely unwarranted. We are gifted that he is willing to share his deep study and understanding with us, both as world-class libraries and in conversation.
Not only have I never disputed this, I’ve never said anything that even remotely comes close to disputing it. Like I said, you’re generalizing my comments to give it a lot more scope than what I had actually written. To be clear, I absolutely agree with this generalized advice to end users. That’s why I wrote a regex engine that descends from RE2.
I don’t really know what to say to the rest of your comment. But if you’re only point is that “understanding computational complexity is really important for end users writing regexes,” then I’m going say, “yeah, so? What in the heck does that have to do with what I said?”
I don’t know what it is I’ve said that has caused you to dig into this rabbit hole. I really don’t. I just re-read my comments and I just cannot understand how we got here.
And I don’t mean to “play the expert” as in “respect my authority.” I brought up my credentials as if to say, “no, actually, you don’t need to explain the basic computational properties of regexes to me.”
Yeah I think we’re disagreeing on the intuition. And while I respect your unique viewpoint as the implementer of an engine (and acknowledge I learned many things from the discussion), I disagree with a lot of the intuition you laid out.
At least 3 times in this discussion when I go back to the sources, my intuition has panned out. I may have said some things that are incorrect around the edges, but they turned out to be more correct than the nitpicking.
You and other seems to be eager to nitpick when I offer general advice. The advice is aimed at casual users of regex who may get the false impression that they are slow. My point is that they are applicable to more situations than most people think (including syntax highlighters, used appropriately).
Examples:
So basically we’re just thinking about it from different perspectives. I think your perspective is valuable but your intuitions don’t line up with mine, and that doesn’t mean I’m wrong.
[1] To give another example of this, I recall Rob Pike complaining 10+ years ago that GNU grep is 10x too slow when
LANG!=C
. It was a longstanding performance bug in one of the most deployed programs in the world. But basically nobody cares, and the bug persisted because it’s still linear time. I’m still interested in applications where people care about a 10x constant factor. I’m sure some exist but it’s not something I’ve hit in 15 years of using many engines.Most software is slow, and regexes are fast. I’m thinking about it from a systems perspective, where you get 10x constant slowdowns all over the place, in multiple components.
It’s similar in spirit to what I wrote here:
https://old.reddit.com/r/ProgrammingLanguages/comments/gyf1l6/is_implementing_a_bytecode_interpreter_in_an/fta56fh/
Using Python is an order of magnitude constant factor slower. It matters in some systems, and doesn’t matter in others. The nitpicking there, which is missing the point, is reminiscent of this thread!
In both cases, I’m offering people rules of thumb for performance, based on experience. You can always nitpick a rule of thumb. It doesn’t mean it’s useful to do so, or even directionally correct. I’m happy to acknowledge inaccuracies as I have in every case.
Ug. No I didn’t. I didn’t nitpick any rules of thumb. I was commenting on one specific instance.
I am specifically going to avoid talking to you about regexes from now on. Conversing with you has been by far the most unpleasant part of my day. I’m not going to repeat that mistake.
No, I meant that a lot of the algorithms fundamentally add time-cost to the other operations, too. In retrospect it sounds incorrect, I wish I could remember where I read it.
It depends on the implementation. In the case of fancy-regex, it is a true hybrid. If there are no fancy features, then it defers to a fully finite automaton based regex engine. In the case of other regex engines, it depends. For a simple alternation of literals, any regex engine could pretty easily just use Aho-Corasick to avoid the much slower backtracking implementation. As things get more complex, if you aren’t a true hybrid, then yes, you will definitely pay for it.
There is give and take here. I hope to some day have a reasonably comprehensive regex benchmark where I could just point you to examples and make this conversation much simpler. :)
Russ Cox’s series of articles maybe?
This is reall. I wish it was more common to write retrospectives when you move on from a project. It would massively simplify the “is this project still going or has everyone moved on” problem we often see in community software.
I remember Xi being discussed here maybe 2 years ago, and the author was even here on Lobsters. I believe it was a YouTube video demonstrating and explaining the technologies used by the project, and one of the claims was “it was an editor for the next 20 years”. And while the author answered a lot of (technical) questions asked elsewhere, he for whatever reason avoided me asking “why the next 20 years” and what would make it interesting from a non-technical perspective. Sure speed is cool, but considering that VSCode is currently gobbling up the market and it’s based on a web browser, I think people look for more than “just” good data structures – after all, if all goes well that’s the part nobody notices.
I touched on this a bit in the post. At the time I was preparing for that talk, my idea was that we would systematically work out protocols and interfaces to coordinate the modules so the result would be an excellent experience. Between the time I pitched the talk and when I gave it, my focus turned to performance issues in the underlying UI toolkit, and my thinking on the viability of those async-heavy interfaces also evolved, pretty much as I described in the post.
one of the design goals of xi was/is :
i am genuinely wondering if concurrent modification is really, truly needed ? for reference feel free to look at the crdt-retrospective that is linked in this article.
thank you kindly in advance for elucidation.
So “needed” is really a function of your requirements. If you’re trying to build Google Docs, then obviously yes. If you’re doing simple editing of plain text files, then obviously no.
The interesting case, I think, is language servers (and perhaps other similar services). They’re slow enough that blocking synchronously on them would be a horrible experience. Thus, you’re in a situation when there’s a concurrent edit by the user and an “edit” of sorts (basically adding annotations) by the language server. If you don’t pay attention to concurrency issues, then you’ll have all kinds of issues around annotating incorrect regions in the text. I do believe this is a problem today and that it’s possible to do better.
The question, then, is what kind of concurrency mechanism. What xi explored was the idea of building a completely general mechanism based on CRDT, which has a beautiful mathematical theory behind it. This did not work well for two reasons. First, while CRDT handles some use cases well (syntax highlighting), for others it is a poor fit. It might be possible to figure out how to express auto-indentation (and other similar features like “soft spans”) robustly in the CRDT framework, but we didn’t figure out how to do that.
Second, given what I now know of the OT and CRDT space, it is clear that a general mechanism is basically a dream, not reality. When you have a server of some kind doing automated edits, you want to treat those differently from edits from a human user. In particular, in the case of conflict (and conflict is inevitable), you generally want to discard the automated edits.
Coming back to the specific question of auto-indentation. The original plan was to use TextMate syntax definitions. These can sometimes be slow (this is basically a tradeoff for having access to hundreds of definitions), so the goal was to not interrupt typing, perhaps fixing up indentation with a small delay. But in retrospect, and looking at the overall system design, it seems like a better choice is to use an incremental parser (like tree-sitter or Lezer) that is so fast that it is possible to apply its auto-indentation results synchronously without degrading the interactive experience.
wow ! thanks for the lovely explanation, much appreciated. kind regards !