Funnily, the typst author’s masters thesis (which I assume was set in typst) has an image instead of text at page 12. I wonder if they hid a problem that required this workaround.
That was an enlightening thread to read through. As you aren’t taking Typst seriously, would you mind if you ask what you are currently using? I’d assumed that you were going to say LaTeX, but the comments made it clear that LaTeX is just as bad, if not worse.
It might be possible to circumvent native support for this feature for now: typst allows for arbitrary metadata (I have used it for creating notes for a presentation), that can later be queried into a json-like format. Maybe adding some structured metadata to your typst file, querying the relevant info, compiling the pdf and adding the metadata to the pdf with some external tool could be made into a shell-script/tiny wrapper program.
Just an idea, of course native support would be the best on the long run.
Is Typst able to match TeX’s quality for hyphenation and line length balancing yet? Every document I’ve seen so far looks worse than even MS Word in terms of line splitting.
Look at the images in the link. For example this one, it’s making hilariously bad line-breaking decisions.
For example, it decides to break “animo” into “an-
imo”. Keeping the word together but shifting it to the line below would barely have an effect on the first line, but would significantly improve readability.
And it’s doing that in every single typst example I’ve seen so far.
I think that’s a decent decision, since moving the “an” to the next line would cramp it and cause the “permagna” to be split. There is enough space in the line after to move a few characters, but I think breaking “an- imo” is better than “permag- na”.
Of course, I’m no expert, and those are just my two cents.
Typst uses the same hyphenation patterns TeX does. In the example, it is most likely hyphenating Latin with rules for English. Which isn’t great, but setting the language to Latin for this example also isn’t helpful in a tutorial.
I’m not disagreeing, just wondering what rule should be in invoked when hyphenating words (I assume in English, even if the example text is pseudo-Latin). Is that the second part of the hyphenated word should start with a consonant?
For extra fun, English and the fork spoken on the other side of the pond have completely different hyphenation rules. In English, hyphenation decisions are based on root and stem words, in the US version they are based on syllables.
I’m curious about what LaTeX is doing to get better line-breaking decisions, because that isn’t something I noticed before you pointed it out. Is it a fundamental algorithmic choice related to why LaTeX is multi-pass?
TeX hyphenation works as a combination of two things. The line breaking uses a dynamic programming approach that looks at all possible break points (word boundaries, hypenation points) and assigns a badness value for breaking lines at any combination of these and minimises it (the dynamic programming approach throws away the vast majority of the possible search space here). Break points each contribute to badness (breaking between words is fine, breaking at a hyphenation point is worse, I think breaking at the end of a sentence is better but it’s 20 years since I last tried to reimplement TeX’s layout model). Hyphenation points are one of the inputs here.
The way that it identifies the hyphenation points is particularly neat (and ML researchers recently rediscovered this family of algorithms). They build short Markov chains from a large corpus of correctly-hyphenated text that give you the probability of a hyphenation point being in a particular place. They then encode exceptions. I think, for US English, the exception list was around 70 words. You can also manually add exceptions for new words. The really nice thing here is that it’s language agnostic. As long as you have a corpus of valid words, you can generate a very dense data structure that lets you hyphenate any known word correctly and hyphenate unknown words with high probability.
Yup, there’s a configurable limit for this. If, after running the dynamic programming algorithm, the minimum badness that it’s found for a paragraph (or any box) is above the configured threshold, it reports a warning. You can also add \sloppy to allow it to accept a higher badness to avoid writing over the margin. If you look at how this is defined, it’s mostly just tweaking the threshold badness values.
Yup, there are a bunch of things that contribute to badness. The algorithm is pretty general.
It’s also very simple. Many years ago, I had a student implement it for code formatting. You could add penalties for breaking in the middle of a parenthetical clause, for breaking before or after a binary operator, and so on. It produced much better output than clang-format.
Some words, if added to the corpus, would still get hyphenated wrongly, but their influence on the corpus would actually decrease hyphenation accuracy for all other words as well.
This mostly applies to loan words as they tend to follow different hyphenation rules than the rest of the corpus.
The corpus contains the exceptions (that’s how you know that they’re there). The compressed representation is a fixed size, independent of the size of the corpus and so will always have some exceptions (unless the source language is incredibly regular in its hyphenation rules). A lot of outliers also work because they manage to hit the highest-probability breaking points and are wrong only below the threshold value.
That’s exactly the reason why it has to be multi-pass, why it’s so slow and part of why TeX was created in the first place.
TeX ranks each possible line break and hyphenation position and tries to get the best score across an entire paragraph or even across an entire document if page breaks are involved, in contrast to MS Word which tries to get the best score for any two adjacent lines or Typst which just breaks and hyphenates whenever the line length is exceeded.
It’s worth noting that ‘slow’ means ‘it takes tens of milliseconds to typeset a whole page on a modern computer’. Most of the slowness of LaTeX comes from interpreting complex packages, which are written in a language that is barely an abstraction over a Turing machine. SILE implements the same typesetting logic in Lua and is much faster. It also implements the dynamic programming approach for paragraph placement. This was described in the TeX papers but not implemented because a large book would need as much as a megabyte of RAM to hold all of the state and that was infeasible.
Simon has not done a great job at building a community, unfortunately. I’m not sure why - he’s done a lot to change things for other people’s requirements but that hasn’t led to much of a SILE community. In part, he didn’t write much documentation on the internals until very recently, which made it hard to embed in other things (I’d love to implement an NSTypesetter subclass delegating to SILE. The relevant hooks were there, but not documented). This has improved a bit.
Without a community, it suffers from the ecosystem problem. It looks like it’s recently grown an equivalent of TeX’s math mode and BibTeX support, but there’s no equivalent of pfgplots, TikZ, and so on.
I don’t know that much about SILE, but Typst seems to be tackling a different issue that TeX has - awful convoluted syntax.
SILE somewhat gets around this, to be fair - it allows for XML input, which is fairly versatile! But SILE seems more oriented toward typesetting already finished works, while Typst seems to be aiming for the whole stack, even if it has less versatile typesetting.
Different focuses, I guess, though I know Typst wants to improve its typesetting quality.
It is not true that Typst just hyphenates whenever the line length is exceeded. When enabling justification, it uses the same algorithms as TeX both for hyphenation and for linebreaking. It’s true that hyphenation isn’t yet super great, but not because of the fundamental algorithm. It’s more minor things like selecting the best hyphenation cost etc and then there’s some other minor things like river-preventation that aren’t implemented at the moment. I agree that the hyphenation in the linked example isn’t that great. I think part of the problem is that the text language is set to English, but the text is in Latin.
I have a question about the point on highlighting in particular. With a subscription-based model, don’t you lose out on the ability to partially highlight a file based on the visible ranges in the editor? Unless you re-subscribe to the highlighting whenever the visible ranges change, the language server wouldn’t know which slice(s) of the document to highlight after a change comes in.
This was much more interesting and convincing to me than the other article, thank you.
Anyone else interested in how LaTeX handles sentence-spacing might like to check out this link.
LaTeX varies the stretch (“glue”) permitted for a space based on the preceding character and has some simple (often wrong) heuristics to decide if period ends a sentence or not.
Personally, I like the idea that two spaces or a line break after a period indicates a sentence-ending period.
That would be easy for writers to use without requiring the kinda ugly macros that LaTeX has for explicitly defining what the space after a period should be.
Another option would be to use smarter heuristics, but of course they would not be infallible.
Shame that browsers justify en-spaces and em-spaces incorrectly. Looking at the CSS standard, this is a bug because non-collapsible whitespace characters should “hang”: ref
I love everything about it, but (for me) the thing that makes TeX documents look so good, and that distinguishes them from those made in something like Word or Google Docs, is the great spacing - not just between words and sentences, but also between paragraphs.
You can see this in action if you use something like TeXmacs, since it’ll auto-reflow text as you type and uses the same spacing algorithms as TeX. You’ll notice that it justifies spacing, not just between words, but between paragraphs. This leads to all pages having essentially the same height of text, no hanging headers, and minimal page breaks inside paragraphs. IMO it just looks objectively better, and once I noticed it I couldn’t get myself to use Typst, because for me the output just looks worse…
It’s a shame, because I really like everything else about Typst. Maybe one day they’ll implement it, but given how I’ve seen zero mention of this in their issue tracker, it seems I’m the only one who actually cares about this. Oh well.
We’re aware of the subpar spacing compared to TeX. Layout engine improvements including but not limited to spacing are planned, but we’re a small team and some other things have higher priority right now. Feel free to open an issue about this!
This is great news! I really do love everything else about Typst, sorry if I came off a bit harsh. Looking forward to it! Might open an issue so it can be used for tracking then :)
Another way of attacking this problem might be to use hashes of the inputs, and propagate those down through computations lazilly. You could design it somewhat like git, meaning (a) we use a long enough hash that we ignore the possibility of hash collision and (b) retain “stale” computations, at least for a while, so that when a user presses undo (or changes between git branches) the computations are already stored and ready to go.
Of couse the compute graph needs to be “sparse” enough that tracking all these hashes isn’t way more work than performing the computations (i.e. the functions need to be complex enough).
I actually never grokked how you are meant to elide the e.g. position information and still produce great error messages downstream. Does anyone have any tips on this? Including a “link” backwards (from the IR to the AST, from the AST to the source file or CST, etc) would not allow for “cutoff”. How do you efficiently query “where did this thing come from” while maximising the “cutoff shields”?
(Aside: given the title of the post I first wondered if was going to talk about (b), but the [durable,standard,volatile] version vector does seem like a nice domain-specific optimization!).
I actually never grokked how you are meant to elide the e.g. position information and still produce great error messages downstream.
The behavior you want here is, when a user adds a space, for the error messages to be recomputed, but all the intermediate analysis steps to be re-used. That is, you want to store somewhat abstract “this expression has a type error” internally, and convert it to “expression on line 92 has a type error” near the edges of the system.
The way to do this is by splitting the information early, and joining it back later. For line numbers, it could conceptually work like this:
parse query returns AST with positions, so it needs to be recomputed every time
parse_positionless calls parse to get AST with positions, copies it, and sets all positions to 0. This enables early cutoff. Note that you still can identify AST nodes by saying “nth AST node in the file”
parse_positions calls parse to get AST with positions, and then extracts just the positions by traversing the ast and collecting positions in a list, such that nth element is the position of nth node.
together, these two queries effectively split parse into two components, one of which is stable and one of which is not.
the bulk of the compiler calls only positionless parse, and operates with AST indexes internally
error rendering code uses positions to convert AST indexes back into line numbers.
For a real world example, take a look at Body, BodySourceMap, body_with_source_map_query and body_query here:
One alternative approach is to store relative offsets in the AST. Eg, stuff inside function stores offsets relative to function itself. This way, you avoid splitting, and this is an approach more easily adaptable by initially non-incremental code base.
EDIT: also, tracing a path of a single diagnostics through rust-analyzer could also help:
Don’t the AST indices still change across the whole file if a new node appears in the middle of the file? If yes, wouldn’t that hurt the cache hit ratio?
This depends on how you assign IDs, there are various tricks to make them more stable!
In rust-analyzer, we actually don’t use AST ids. Rather, we create stable IDs when we create semantic symbols. So a “function” in a file gets an ID, but that’s not an ID of the corresponding AST node. This ID changes only if the set of functions in file changes.
if you have a tree-shaped thing, you can assign IDs in BFS, rather than DFS order. Eg, ID of each function is smaller than ID of every expression. As edits usually happen deep in the tree, top level IDs stay the same (that is, editing body of a first function in the file doesn’t shift IDs of subsequent functions).
You often want relative IDs: nth expression in the function X.
The most stable form of id is a triple (ParentId, Name, Disambiguator). Eg, function is identified by its parent module, its name, and (if the module contains several identically-named functions) its position in the list of name-colliding functiins. This representation has a problem that ParentId is logically also a triple! And then grandparent ID as well! To break this cycle, you want ID interner, that turns a triple into an u32, with the constraint that identical triples give the same u32.
I think it’s quite different actually! At least, I learned something new from your comment! The way I understand it, the crux of you approach is that we assign ids “somehow” the first time around, and after that, we explicitly look at the before/after picture, and try to manually come up with a mapping which preserves most IDs. And we also use cute trick, allocating ids with “gaps”, so that “binary search” works.
This I think is actually fundamentally different from salsa, as it doesn’t really expose the “old” version to you, so you have to assign IDs “from scratch” every time, and just make sure that they align. So in the end you tend to use “paths” as IDs (that is “first if in function foo in module bar in module baz in crate quux). But I think salsa’s interning here captures that aspect of making “arbitrary” decision and then sticking to it (but there’s nothing comparable to evenly spreading the ids,salsa just auto-increments).
This is a bit tangential, but one thing I wonder: You say that you use paths as keys and turn them into IDs through interning. Since these IDs are used all over the place, I assume that they are long-lived and end up in the query cache. Do you ever clear the interner or do you just continue “leaking memory”?
I ask because I’ve implemented a similar thing for Typst recently and ended up going for the “just leak memory” strategy (IDs are u16, so we can’t leak all that much). Basically, a Span in Typst is 64 bits: 48 bits are the stable span number and 16 bits are a FileId, which is basically a (Option<PackageSpec>, PathBuf) pair identifying a file in a package. Because spans can end up in the cache, I can only clear the interner if also clearing the cache. Is something similar going on in rust-analyzer/salsa or how do you deal with this?
We just leak things. Moreover, salsa doesn’t really have a garbage collector (it had at one point, but it was slow, so we disabled it). Works surprisingly fine! This is actually expected: most of the code is in dependencies, and, even you type non-stop in a single editing session, you cannot create more code than there already is. But of course this needs a better solution long term…
To give one alternative approach, here’s how we do it in Typst:
A Source file is a representation of a parsed file
Each syntax node in a source file has a span ID
Span IDs are ordered in the file such that:
A node’s ID is always larger than any ID in a subtree of a previous sibling
A node’s ID is always smaller than the IDs of all of its children and following siblings.
Span IDs are spread out as evenly as possible across the ID space (0..2^48)
Thanks to this ordering, we can locate a node and determine its byte offset quickly if given a span ID (to render the error message).
Source files are long-lived: When an edit comes in, we do incremental reparsing and only update the span IDs of the node that we replaced. If there are lots of edits in one area and we run out of IDs (because they IDs aren’t evenly distributed anymore), we renumber a larger and larger area around it until it works.
With this setup, span numbers only really change around the edit, even if in the middle of the file. Currently, we depend on incremental parsing to work well. But this can also work without incremental parsing: By diffing the span numbers in the old syntax tree and new one (after an edit), we can assign them as parallel as possible to change the numbers of as few nodes as possible. This is not yet implemented in Typst because the approach based on incremental parsing works well enough, but it probably will be at some point to make it even more stable.
I should note that Typst uses comemo instead of salsa, but this should work just as well with salsa.
Hi, Typst dev here. You understand correctly. The GUI is closed source and the compiler is open source. Some of the web app’s IDE-like features (e.g. autocomplete) are part of the open source library though. Right now, everything is completely free, but at some point in the future the web app will have paid features.
Nice tutorial, it’s great to see all these ideas summarized! I implemented pretty much exactly this parser and syntax tree design for Typst, after studying rust-analyzer’s implementation. It works great.
This is an exciting project, and one that I think has an immense amount of potential. I read through sections 3, 4 and 5, and will take a look at the rest at some point, but overall the language and compiler seem feature-full given how new this is! It already supports a good portion of what many people use Latex for (text, alignment, equations and code).
I haven’t played around with it yet, but the language looks well designed and you can definitely see the Rust influence. I think the biggest hurdle to replacing Latex is going to be library support, and the emphasis on making a real language vs. macros will hopefully help jumpstart things. I haven’t seen what the errors look like, but can’t get worse than latex.
I hope this project takes off, a user-friendly alternative to Latex with the same power is needed. It will definitely take a long time to get to feature-parity with Latex, but this seems like a fantastic start!
Hear, hear! My only concern is this wording in the about page:
We will publish Typst’s compiler source code as soon as our beta phase starts. From then on, it will form the Open Core of Typst, and we will develop and maintain it in cooperation with the community.
I sure hope they plan to publish something
at least as usable as latex (i.e. all the parts needed to compile a paper to common output formats are there)
under a license that allows inclusion in mainstream distributions
People who write papers, articles and books care a lot about not losing their work (e.g. if the startup goes under). I don’t write papers any more but I’m sure I wouldn’t have considered typst before those conditions were met.
That is the plan! The open source part will cover the whole compiler and CLI and the license will probably be a permissive one (e.g. MIT/Apache-2). We’re only keeping the web app proprietary, but don’t want to lock anybody into it, so you will always be able to download your projects and compile them locally.
Hey, I’m the author of the thesis. As already pointed out in another comment, it is completely typeset with Typst. Typst also doesn’t use (La)TeX in the backend, I just used a LaTeX-like font as to fit the typical thesis style. Luckily there was no requirement from the instituation.
I only had a chance to skim the thesis, so sorry if I missed something, but it looks as if this reproduces the LaTeX mistake of combining markup and formatting instructions. Do you have conventions for separating them? When I write LaTeX (I’ve written four books in LaTeX, among other things), I have to be very careful to not actually write LaTeX, but instead to write LaTeX-syntax semantic markup and then a separate preamble that defines how to translate this into formatted output. If I don’t do that, then exporting to something like HTML is very hard. I didn’t do this for my first book and the ePub version was a complete mess. I did for later ones and was able to write a different tool that just converted most of my semantic markup into class= attributes on HTML elements or, in a few cases, used them as input for some processing (e.g. headings got added to a ToC, references to code files got parsed with libclang and pretty-printed).
This is one of my favourite things about SILE: your input is in some text markup language to describe the text and a programming language (Lua) to describe transforms on the text.
You can combine markup and formatting instructions, but you can also write semantic markup in Typst. Typst’s styling system works with the document’s structure: A show rule for an element defines the transformation from structure into output (or other structure that is transformed recursively). Since the show rule can execute arbitrary code, you have lots of flexibility here. You can even apply other rules to the content that shall be transformed, so e.g. your transformation for figures could redefine how emphasized text is transformed within them.
At the moment, you can only use show rules with built-in structural elements, so user-defined functions don’t work with it, but this is something we will support in the future. And since PDF is the only supported format for now, in the end your show rules will transform the document to a fixed layout. However, more export formats (primarily HTML) are on our radar and we could then provide different sets of primitives for these other formats. This way, you could have a second set of show rules that define how to export the structure to XML.
Thanks. I’ll look forward to seeing what it looks like with HTML output. It’s a very different problem, but a language that can generate both TeX-quality PDFs and clean semantic HTML would be very attractive.
I was wondering the same, and then wondered why there aren’t more of these sorts of systems that just output TeX as a backend language. I suppose part of the goal is not just the language, but to reinvent the ickier parts of the TeX ecosystem.
Funnily, the typst author’s masters thesis (which I assume was set in typst) has an image instead of text at page 12. I wonder if they hid a problem that required this workaround.
That was just Acrobat being weird when converting to PDF/A.
I like it, but going to need some progress on accessibility before I seriously adopt it.
That was an enlightening thread to read through. As you aren’t taking Typst seriously, would you mind if you ask what you are currently using? I’d assumed that you were going to say LaTeX, but the comments made it clear that LaTeX is just as bad, if not worse.
As mentioned elsewhere in the thread, it is possible to tag LaTeX documents.
But Typst has the opportunity to go beyond merely possible with enough effort. :)
🤦 I missed that part. That’s good to know about and something I’ll ensure is included and any future documents I produce.
I also share your hope that Typst takes advantage of the opportunity that they have for better accessibility.
We definitely plan to make Typst documents accessible in the future!
That’s really good to hear!
It might be possible to circumvent native support for this feature for now: typst allows for arbitrary metadata (I have used it for creating notes for a presentation), that can later be queried into a json-like format. Maybe adding some structured metadata to your typst file, querying the relevant info, compiling the pdf and adding the metadata to the pdf with some external tool could be made into a shell-script/tiny wrapper program.
Just an idea, of course native support would be the best on the long run.
Is Typst able to match TeX’s quality for hyphenation and line length balancing yet? Every document I’ve seen so far looks worse than even MS Word in terms of line splitting.
Seems like it can
https://typst.app/docs/tutorial/advanced-styling/
Look at the images in the link. For example this one, it’s making hilariously bad line-breaking decisions.
For example, it decides to break “animo” into “an- imo”. Keeping the word together but shifting it to the line below would barely have an effect on the first line, but would significantly improve readability.
And it’s doing that in every single typst example I’ve seen so far.
I think that’s a decent decision, since moving the “an” to the next line would cramp it and cause the “permagna” to be split. There is enough space in the line after to move a few characters, but I think breaking “an- imo” is better than “permag- na”.
Of course, I’m no expert, and those are just my two cents.
Regardless of the decision to break it up, it should be “a-ni-mo”, not “an-imo”.
Typst uses the same hyphenation patterns TeX does. In the example, it is most likely hyphenating Latin with rules for English. Which isn’t great, but setting the language to Latin for this example also isn’t helpful in a tutorial.
I’m not disagreeing, just wondering what rule should be in invoked when hyphenating words (I assume in English, even if the example text is pseudo-Latin). Is that the second part of the hyphenated word should start with a consonant?
For extra fun, English and the fork spoken on the other side of the pond have completely different hyphenation rules. In English, hyphenation decisions are based on root and stem words, in the US version they are based on syllables.
“Two countries separated by a common language.”
I’m curious about what LaTeX is doing to get better line-breaking decisions, because that isn’t something I noticed before you pointed it out. Is it a fundamental algorithmic choice related to why LaTeX is multi-pass?
TeX hyphenation works as a combination of two things. The line breaking uses a dynamic programming approach that looks at all possible break points (word boundaries, hypenation points) and assigns a badness value for breaking lines at any combination of these and minimises it (the dynamic programming approach throws away the vast majority of the possible search space here). Break points each contribute to badness (breaking between words is fine, breaking at a hyphenation point is worse, I think breaking at the end of a sentence is better but it’s 20 years since I last tried to reimplement TeX’s layout model). Hyphenation points are one of the inputs here.
The way that it identifies the hyphenation points is particularly neat (and ML researchers recently rediscovered this family of algorithms). They build short Markov chains from a large corpus of correctly-hyphenated text that give you the probability of a hyphenation point being in a particular place. They then encode exceptions. I think, for US English, the exception list was around 70 words. You can also manually add exceptions for new words. The really nice thing here is that it’s language agnostic. As long as you have a corpus of valid words, you can generate a very dense data structure that lets you hyphenate any known word correctly and hyphenate unknown words with high probability.
All those cryptic warnings about
badness 10000
finally mean something.“underfull hbox badness 10000” haunt my nightmares
Yup, there’s a configurable limit for this. If, after running the dynamic programming algorithm, the minimum badness that it’s found for a paragraph (or any box) is above the configured threshold, it reports a warning. You can also add
\sloppy
to allow it to accept a higher badness to avoid writing over the margin. If you look at how this is defined, it’s mostly just tweaking the threshold badness values.I think TeX also tries to avoid rivers, right?
Yup, there are a bunch of things that contribute to badness. The algorithm is pretty general.
It’s also very simple. Many years ago, I had a student implement it for code formatting. You could add penalties for breaking in the middle of a parenthetical clause, for breaking before or after a binary operator, and so on. It produced much better output than clang-format.
Huh, it’s surprising to me that you still need an exception list. Can you fix your corpus instead so it has a bunch of examples for the exceptions?
Some words, if added to the corpus, would still get hyphenated wrongly, but their influence on the corpus would actually decrease hyphenation accuracy for all other words as well.
This mostly applies to loan words as they tend to follow different hyphenation rules than the rest of the corpus.
The corpus contains the exceptions (that’s how you know that they’re there). The compressed representation is a fixed size, independent of the size of the corpus and so will always have some exceptions (unless the source language is incredibly regular in its hyphenation rules). A lot of outliers also work because they manage to hit the highest-probability breaking points and are wrong only below the threshold value.
That’s exactly the reason why it has to be multi-pass, why it’s so slow and part of why TeX was created in the first place.
TeX ranks each possible line break and hyphenation position and tries to get the best score across an entire paragraph or even across an entire document if page breaks are involved, in contrast to MS Word which tries to get the best score for any two adjacent lines or Typst which just breaks and hyphenates whenever the line length is exceeded.
It’s worth noting that ‘slow’ means ‘it takes tens of milliseconds to typeset a whole page on a modern computer’. Most of the slowness of LaTeX comes from interpreting complex packages, which are written in a language that is barely an abstraction over a Turing machine. SILE implements the same typesetting logic in Lua and is much faster. It also implements the dynamic programming approach for paragraph placement. This was described in the TeX papers but not implemented because a large book would need as much as a megabyte of RAM to hold all of the state and that was infeasible.
SILE: https://sile-typesetter.org/what-is-sile/ if anyone was wondering.
This reminds me, I never understood why typst got so much attention while SILE seems ignored. Wouldn’t SILE be an equally good replacement for the OP?
Simon has not done a great job at building a community, unfortunately. I’m not sure why - he’s done a lot to change things for other people’s requirements but that hasn’t led to much of a SILE community. In part, he didn’t write much documentation on the internals until very recently, which made it hard to embed in other things (I’d love to implement an
NSTypesetter
subclass delegating to SILE. The relevant hooks were there, but not documented). This has improved a bit.Without a community, it suffers from the ecosystem problem. It looks like it’s recently grown an equivalent of TeX’s math mode and BibTeX support, but there’s no equivalent of pfgplots, TikZ, and so on.
I don’t know that much about SILE, but Typst seems to be tackling a different issue that TeX has - awful convoluted syntax.
SILE somewhat gets around this, to be fair - it allows for XML input, which is fairly versatile! But SILE seems more oriented toward typesetting already finished works, while Typst seems to be aiming for the whole stack, even if it has less versatile typesetting.
Different focuses, I guess, though I know Typst wants to improve its typesetting quality.
Im not familiar with either SILE or Typst, but maybe the input format is better in Typst for OP?
It is not true that Typst just hyphenates whenever the line length is exceeded. When enabling justification, it uses the same algorithms as TeX both for hyphenation and for linebreaking. It’s true that hyphenation isn’t yet super great, but not because of the fundamental algorithm. It’s more minor things like selecting the best hyphenation cost etc and then there’s some other minor things like river-preventation that aren’t implemented at the moment. I agree that the hyphenation in the linked example isn’t that great. I think part of the problem is that the text language is set to English, but the text is in Latin.
I have a question about the point on highlighting in particular. With a subscription-based model, don’t you lose out on the ability to partially highlight a file based on the visible ranges in the editor? Unless you re-subscribe to the highlighting whenever the visible ranges change, the language server wouldn’t know which slice(s) of the document to highlight after a change comes in.
This was much more interesting and convincing to me than the other article, thank you.
Anyone else interested in how LaTeX handles sentence-spacing might like to check out this link. LaTeX varies the stretch (“glue”) permitted for a space based on the preceding character and has some simple (often wrong) heuristics to decide if period ends a sentence or not. Personally, I like the idea that two spaces or a line break after a period indicates a sentence-ending period. That would be easy for writers to use without requiring the kinda ugly macros that LaTeX has for explicitly defining what the space after a period should be.
Another option would be to use smarter heuristics, but of course they would not be infallible.
The more recent typst system appears to use a simpler algorithm with constant stretch after “words” which, now that it has been pointed out to me, I perceive as worse. Thanks typographers.
Shame that browsers justify en-spaces and em-spaces incorrectly. Looking at the CSS standard, this is a bug because non-collapsible whitespace characters should “hang”: ref
This is the #1 reason I’m not using Typst…
I love everything about it, but (for me) the thing that makes TeX documents look so good, and that distinguishes them from those made in something like Word or Google Docs, is the great spacing - not just between words and sentences, but also between paragraphs.
You can see this in action if you use something like TeXmacs, since it’ll auto-reflow text as you type and uses the same spacing algorithms as TeX. You’ll notice that it justifies spacing, not just between words, but between paragraphs. This leads to all pages having essentially the same height of text, no hanging headers, and minimal page breaks inside paragraphs. IMO it just looks objectively better, and once I noticed it I couldn’t get myself to use Typst, because for me the output just looks worse…
It’s a shame, because I really like everything else about Typst. Maybe one day they’ll implement it, but given how I’ve seen zero mention of this in their issue tracker, it seems I’m the only one who actually cares about this. Oh well.
We’re aware of the subpar spacing compared to TeX. Layout engine improvements including but not limited to spacing are planned, but we’re a small team and some other things have higher priority right now. Feel free to open an issue about this!
This is great news! I really do love everything else about Typst, sorry if I came off a bit harsh. Looking forward to it! Might open an issue so it can be used for tracking then :)
Nice.
Another way of attacking this problem might be to use hashes of the inputs, and propagate those down through computations lazilly. You could design it somewhat like git, meaning (a) we use a long enough hash that we ignore the possibility of hash collision and (b) retain “stale” computations, at least for a while, so that when a user presses undo (or changes between git branches) the computations are already stored and ready to go.
Of couse the compute graph needs to be “sparse” enough that tracking all these hashes isn’t way more work than performing the computations (i.e. the functions need to be complex enough).
I actually never grokked how you are meant to elide the e.g. position information and still produce great error messages downstream. Does anyone have any tips on this? Including a “link” backwards (from the IR to the AST, from the AST to the source file or CST, etc) would not allow for “cutoff”. How do you efficiently query “where did this thing come from” while maximising the “cutoff shields”?
(Aside: given the title of the post I first wondered if was going to talk about (b), but the [durable,standard,volatile] version vector does seem like a nice domain-specific optimization!).
The behavior you want here is, when a user adds a space, for the error messages to be recomputed, but all the intermediate analysis steps to be re-used. That is, you want to store somewhat abstract “this expression has a type error” internally, and convert it to “expression on line 92 has a type error” near the edges of the system.
The way to do this is by splitting the information early, and joining it back later. For line numbers, it could conceptually work like this:
parse
query returns AST with positions, so it needs to be recomputed every timeparse_positionless
calls parse to get AST with positions, copies it, and sets all positions to 0. This enables early cutoff. Note that you still can identify AST nodes by saying “nth AST node in the file”parse_positions
calls parse to get AST with positions, and then extracts just the positions by traversing the ast and collecting positions in a list, such that nth element is the position of nth node.For a real world example, take a look at
Body
,BodySourceMap
,body_with_source_map_query
andbody_query
here:https://github.com/rust-lang/rust-analyzer/blob/b64e5b3919b24bc784f36248e6e1f921ee7bb71b/crates/hir-def/src/body.rs#L117
One alternative approach is to store relative offsets in the AST. Eg, stuff inside function stores offsets relative to function itself. This way, you avoid splitting, and this is an approach more easily adaptable by initially non-incremental code base.
EDIT: also, tracing a path of a single diagnostics through rust-analyzer could also help:
https://github.com/search?q=repo%3Arust-lang%2Frust-analyzer%20UnreachableLabel&type=code
Thank you, that’s quite helpful.
Don’t the AST indices still change across the whole file if a new node appears in the middle of the file? If yes, wouldn’t that hurt the cache hit ratio?
This depends on how you assign IDs, there are various tricks to make them more stable!
(ParentId, Name, Disambiguator)
. Eg, function is identified by its parent module, its name, and (if the module contains several identically-named functions) its position in the list of name-colliding functiins. This representation has a problem that ParentId is logically also a triple! And then grandparent ID as well! To break this cycle, you want ID interner, that turns a triple into anu32
, with the constraint that identical triples give the same u32.Okay, I see, it was just simplified for explanation. Then, it is not that different from what we do in Typst after all (see my other comment).
I think it’s quite different actually! At least, I learned something new from your comment! The way I understand it, the crux of you approach is that we assign ids “somehow” the first time around, and after that, we explicitly look at the before/after picture, and try to manually come up with a mapping which preserves most IDs. And we also use cute trick, allocating ids with “gaps”, so that “binary search” works.
This I think is actually fundamentally different from salsa, as it doesn’t really expose the “old” version to you, so you have to assign IDs “from scratch” every time, and just make sure that they align. So in the end you tend to use “paths” as IDs (that is “first
if
in functionfoo
in modulebar
in modulebaz
in cratequux
). But I think salsa’s interning here captures that aspect of making “arbitrary” decision and then sticking to it (but there’s nothing comparable to evenly spreading the ids,salsa just auto-increments).This is a bit tangential, but one thing I wonder: You say that you use paths as keys and turn them into IDs through interning. Since these IDs are used all over the place, I assume that they are long-lived and end up in the query cache. Do you ever clear the interner or do you just continue “leaking memory”?
I ask because I’ve implemented a similar thing for Typst recently and ended up going for the “just leak memory” strategy (IDs are u16, so we can’t leak all that much). Basically, a
Span
in Typst is 64 bits: 48 bits are the stable span number and 16 bits are aFileId
, which is basically a(Option<PackageSpec>, PathBuf)
pair identifying a file in a package. Because spans can end up in the cache, I can only clear the interner if also clearing the cache. Is something similar going on in rust-analyzer/salsa or how do you deal with this?We just leak things. Moreover, salsa doesn’t really have a garbage collector (it had at one point, but it was slow, so we disabled it). Works surprisingly fine! This is actually expected: most of the code is in dependencies, and, even you type non-stop in a single editing session, you cannot create more code than there already is. But of course this needs a better solution long term…
To give one alternative approach, here’s how we do it in Typst:
Source
file is a representation of a parsed fileSource files are long-lived: When an edit comes in, we do incremental reparsing and only update the span IDs of the node that we replaced. If there are lots of edits in one area and we run out of IDs (because they IDs aren’t evenly distributed anymore), we renumber a larger and larger area around it until it works.
With this setup, span numbers only really change around the edit, even if in the middle of the file. Currently, we depend on incremental parsing to work well. But this can also work without incremental parsing: By diffing the span numbers in the old syntax tree and new one (after an edit), we can assign them as parallel as possible to change the numbers of as few nodes as possible. This is not yet implemented in Typst because the approach based on incremental parsing works well enough, but it probably will be at some point to make it even more stable.
I should note that Typst uses comemo instead of salsa, but this should work just as well with salsa.
Presumably, you still store it; just don’t consider changes to it semantically significant.
Very cool! I‘m so sad that I don’t have a reason to use LaTex or Typst right now. I would really like to try Typst out for a serious project.
How does the package manager resolve version conflicts if two packages require itself different versions of a third package?
Then it simply loads both versions of the third package. A package can be used in arbitrarily many versions by the same project.
Am I understanding correctly that the whole GUI is closed source, only the CLI is open source? I’m not criticizing, I’m asking :)
Hi, Typst dev here. You understand correctly. The GUI is closed source and the compiler is open source. Some of the web app’s IDE-like features (e.g. autocomplete) are part of the open source library though. Right now, everything is completely free, but at some point in the future the web app will have paid features.
Why are all of their communications options closed source? GitHub + Discord? smh
Nice tutorial, it’s great to see all these ideas summarized! I implemented pretty much exactly this parser and syntax tree design for Typst, after studying rust-analyzer’s implementation. It works great.
This is an exciting project, and one that I think has an immense amount of potential. I read through sections 3, 4 and 5, and will take a look at the rest at some point, but overall the language and compiler seem feature-full given how new this is! It already supports a good portion of what many people use Latex for (text, alignment, equations and code).
I haven’t played around with it yet, but the language looks well designed and you can definitely see the Rust influence. I think the biggest hurdle to replacing Latex is going to be library support, and the emphasis on making a real language vs. macros will hopefully help jumpstart things. I haven’t seen what the errors look like, but can’t get worse than latex.
I hope this project takes off, a user-friendly alternative to Latex with the same power is needed. It will definitely take a long time to get to feature-parity with Latex, but this seems like a fantastic start!
Hear, hear! My only concern is this wording in the about page:
I sure hope they plan to publish something
People who write papers, articles and books care a lot about not losing their work (e.g. if the startup goes under). I don’t write papers any more but I’m sure I wouldn’t have considered typst before those conditions were met.
Thanks for taking this on!
That is the plan! The open source part will cover the whole compiler and CLI and the license will probably be a permissive one (e.g. MIT/Apache-2). We’re only keeping the web app proprietary, but don’t want to lock anybody into it, so you will always be able to download your projects and compile them locally.
How simple is the compiler to re-implement? What’s the ETD (estimated development time)?
I wonder if the thesis itself is typeset in the language?
It does look like standard (La)TeX, but that could be a requirement from the institution.
Hey, I’m the author of the thesis. As already pointed out in another comment, it is completely typeset with Typst. Typst also doesn’t use (La)TeX in the backend, I just used a LaTeX-like font as to fit the typical thesis style. Luckily there was no requirement from the instituation.
Well, you fooled me. :)
I hate that font, but it certainly makes it look authentically LaTeX.
I only had a chance to skim the thesis, so sorry if I missed something, but it looks as if this reproduces the LaTeX mistake of combining markup and formatting instructions. Do you have conventions for separating them? When I write LaTeX (I’ve written four books in LaTeX, among other things), I have to be very careful to not actually write LaTeX, but instead to write LaTeX-syntax semantic markup and then a separate preamble that defines how to translate this into formatted output. If I don’t do that, then exporting to something like HTML is very hard. I didn’t do this for my first book and the ePub version was a complete mess. I did for later ones and was able to write a different tool that just converted most of my semantic markup into
class=
attributes on HTML elements or, in a few cases, used them as input for some processing (e.g. headings got added to a ToC, references to code files got parsed with libclang and pretty-printed).This is one of my favourite things about SILE: your input is in some text markup language to describe the text and a programming language (Lua) to describe transforms on the text.
You can combine markup and formatting instructions, but you can also write semantic markup in Typst. Typst’s styling system works with the document’s structure: A show rule for an element defines the transformation from structure into output (or other structure that is transformed recursively). Since the show rule can execute arbitrary code, you have lots of flexibility here. You can even apply other rules to the content that shall be transformed, so e.g. your transformation for figures could redefine how emphasized text is transformed within them.
At the moment, you can only use show rules with built-in structural elements, so user-defined functions don’t work with it, but this is something we will support in the future. And since PDF is the only supported format for now, in the end your show rules will transform the document to a fixed layout. However, more export formats (primarily HTML) are on our radar and we could then provide different sets of primitives for these other formats. This way, you could have a second set of show rules that define how to export the structure to XML.
Thanks. I’ll look forward to seeing what it looks like with HTML output. It’s a very different problem, but a language that can generate both TeX-quality PDFs and clean semantic HTML would be very attractive.
Page 8:
Got me fooled too, tho. So this is a testament to the output quality already. (It doesn’t do ligatures there tho.)
I was wondering the same, and then wondered why there aren’t more of these sorts of systems that just output TeX as a backend language. I suppose part of the goal is not just the language, but to reinvent the ickier parts of the TeX ecosystem.