Is Typst able to match TeX’s quality for hyphenation and line length balancing yet? Every document I’ve seen so far looks worse than even MS Word in terms of line splitting.
Look at the images in the link. For example this one, it’s making hilariously bad line-breaking decisions.
For example, it decides to break “animo” into “an-
imo”. Keeping the word together but shifting it to the line below would barely have an effect on the first line, but would significantly improve readability.
And it’s doing that in every single typst example I’ve seen so far.
I think that’s a decent decision, since moving the “an” to the next line would cramp it and cause the “permagna” to be split. There is enough space in the line after to move a few characters, but I think breaking “an- imo” is better than “permag- na”.
Of course, I’m no expert, and those are just my two cents.
Typst uses the same hyphenation patterns TeX does. In the example, it is most likely hyphenating Latin with rules for English. Which isn’t great, but setting the language to Latin for this example also isn’t helpful in a tutorial.
I’m not disagreeing, just wondering what rule should be in invoked when hyphenating words (I assume in English, even if the example text is pseudo-Latin). Is that the second part of the hyphenated word should start with a consonant?
For extra fun, English and the fork spoken on the other side of the pond have completely different hyphenation rules. In English, hyphenation decisions are based on root and stem words, in the US version they are based on syllables.
I’m curious about what LaTeX is doing to get better line-breaking decisions, because that isn’t something I noticed before you pointed it out. Is it a fundamental algorithmic choice related to why LaTeX is multi-pass?
TeX hyphenation works as a combination of two things. The line breaking uses a dynamic programming approach that looks at all possible break points (word boundaries, hypenation points) and assigns a badness value for breaking lines at any combination of these and minimises it (the dynamic programming approach throws away the vast majority of the possible search space here). Break points each contribute to badness (breaking between words is fine, breaking at a hyphenation point is worse, I think breaking at the end of a sentence is better but it’s 20 years since I last tried to reimplement TeX’s layout model). Hyphenation points are one of the inputs here.
The way that it identifies the hyphenation points is particularly neat (and ML researchers recently rediscovered this family of algorithms). They build short Markov chains from a large corpus of correctly-hyphenated text that give you the probability of a hyphenation point being in a particular place. They then encode exceptions. I think, for US English, the exception list was around 70 words. You can also manually add exceptions for new words. The really nice thing here is that it’s language agnostic. As long as you have a corpus of valid words, you can generate a very dense data structure that lets you hyphenate any known word correctly and hyphenate unknown words with high probability.
Yup, there’s a configurable limit for this. If, after running the dynamic programming algorithm, the minimum badness that it’s found for a paragraph (or any box) is above the configured threshold, it reports a warning. You can also add \sloppy to allow it to accept a higher badness to avoid writing over the margin. If you look at how this is defined, it’s mostly just tweaking the threshold badness values.
Yup, there are a bunch of things that contribute to badness. The algorithm is pretty general.
It’s also very simple. Many years ago, I had a student implement it for code formatting. You could add penalties for breaking in the middle of a parenthetical clause, for breaking before or after a binary operator, and so on. It produced much better output than clang-format.
Some words, if added to the corpus, would still get hyphenated wrongly, but their influence on the corpus would actually decrease hyphenation accuracy for all other words as well.
This mostly applies to loan words as they tend to follow different hyphenation rules than the rest of the corpus.
The corpus contains the exceptions (that’s how you know that they’re there). The compressed representation is a fixed size, independent of the size of the corpus and so will always have some exceptions (unless the source language is incredibly regular in its hyphenation rules). A lot of outliers also work because they manage to hit the highest-probability breaking points and are wrong only below the threshold value.
That’s exactly the reason why it has to be multi-pass, why it’s so slow and part of why TeX was created in the first place.
TeX ranks each possible line break and hyphenation position and tries to get the best score across an entire paragraph or even across an entire document if page breaks are involved, in contrast to MS Word which tries to get the best score for any two adjacent lines or Typst which just breaks and hyphenates whenever the line length is exceeded.
It’s worth noting that ‘slow’ means ‘it takes tens of milliseconds to typeset a whole page on a modern computer’. Most of the slowness of LaTeX comes from interpreting complex packages, which are written in a language that is barely an abstraction over a Turing machine. SILE implements the same typesetting logic in Lua and is much faster. It also implements the dynamic programming approach for paragraph placement. This was described in the TeX papers but not implemented because a large book would need as much as a megabyte of RAM to hold all of the state and that was infeasible.
Simon has not done a great job at building a community, unfortunately. I’m not sure why - he’s done a lot to change things for other people’s requirements but that hasn’t led to much of a SILE community. In part, he didn’t write much documentation on the internals until very recently, which made it hard to embed in other things (I’d love to implement an NSTypesetter subclass delegating to SILE. The relevant hooks were there, but not documented). This has improved a bit.
Without a community, it suffers from the ecosystem problem. It looks like it’s recently grown an equivalent of TeX’s math mode and BibTeX support, but there’s no equivalent of pfgplots, TikZ, and so on.
I don’t know that much about SILE, but Typst seems to be tackling a different issue that TeX has - awful convoluted syntax.
SILE somewhat gets around this, to be fair - it allows for XML input, which is fairly versatile! But SILE seems more oriented toward typesetting already finished works, while Typst seems to be aiming for the whole stack, even if it has less versatile typesetting.
Different focuses, I guess, though I know Typst wants to improve its typesetting quality.
It is not true that Typst just hyphenates whenever the line length is exceeded. When enabling justification, it uses the same algorithms as TeX both for hyphenation and for linebreaking. It’s true that hyphenation isn’t yet super great, but not because of the fundamental algorithm. It’s more minor things like selecting the best hyphenation cost etc and then there’s some other minor things like river-preventation that aren’t implemented at the moment. I agree that the hyphenation in the linked example isn’t that great. I think part of the problem is that the text language is set to English, but the text is in Latin.
That was an enlightening thread to read through. As you aren’t taking Typst seriously, would you mind if you ask what you are currently using? I’d assumed that you were going to say LaTeX, but the comments made it clear that LaTeX is just as bad, if not worse.
It might be possible to circumvent native support for this feature for now: typst allows for arbitrary metadata (I have used it for creating notes for a presentation), that can later be queried into a json-like format. Maybe adding some structured metadata to your typst file, querying the relevant info, compiling the pdf and adding the metadata to the pdf with some external tool could be made into a shell-script/tiny wrapper program.
Just an idea, of course native support would be the best on the long run.
I have a two-page paper on a minor point of complex differential geometry I typeset whenever I try out a new typesetting system. It only has two unnumbered theorems, a few displayed formulas, and a couple of citations, so it is a low bar to clear. (Even so, some systems do not clear it. Looking at you, groff.)
Typesetting it in Typst was fairly painless. The math syntax is a lot nicer than that of LaTeX, even though I burned those commands into my muscle memory a long time ago. I didn’t look for external packages or things to deal with the theorem environments and proofs, so I just typeset them by hand a la TeX. That was easy and the upgrade path to something more LaTeX-like was clear.
Once we get into the weeds, I do think that LaTeX’s many many decades of polish start to show. I used to work at a scientific publisher, so I have opinions about typesetting. They bump against some things here. A minor one is that I don’t like the sizes Typst (automatically) picks for displayed nested parentheses. I also don’t like the sizes LaTeX’s \left( and \right) macros pick either, so this isn’t a major point against Typst.
A larger point against it here is that once I wanted to typeset proofs with the traditional tombstone/Halmos at the end, I ran into horizontal space control issues I couldn’t quite figure out. I wanted to place the tombstone aligned with the right margin of the page. If my proof ended in text mode I could do this by ending the proof like this:
... which is left to the reader. #h(1fr) $qed$
This is pretty nice. However I couldn’t figure out how to get Typst to do the same when the proof ended on a displayed equation. For example, this didn’t work:
#h(1fr) 1 + 1 = 2. #h(1fr) qed
I resorted to binary searching a suitable point size for the horizontal spaces by hand, but There Has Got To Be A Better Way.
All in all pretty interesting. I’m not sure whether the system itself needs the second 90% of work that LaTeX has already had, or whether I need to read the docs more. It’s a remarkable achievement for Typst to have gotten this far.
Funnily, the typst author’s masters thesis (which I assume was set in typst) has an image instead of text at page 12. I wonder if they hid a problem that required this workaround.
Really like Typst, have used it for building a CV but would be interested if in the future it’s plausible to have a html converter. Could see it used for static blogs
I usually rarely have to touch LaTeX, as I write my text in org-mode and let it compile to Latex using org-export. I wonder what the level of support for that workflow is?
Is Typst able to match TeX’s quality for hyphenation and line length balancing yet? Every document I’ve seen so far looks worse than even MS Word in terms of line splitting.
Seems like it can
https://typst.app/docs/tutorial/advanced-styling/
Look at the images in the link. For example this one, it’s making hilariously bad line-breaking decisions.
For example, it decides to break “animo” into “an- imo”. Keeping the word together but shifting it to the line below would barely have an effect on the first line, but would significantly improve readability.
And it’s doing that in every single typst example I’ve seen so far.
I think that’s a decent decision, since moving the “an” to the next line would cramp it and cause the “permagna” to be split. There is enough space in the line after to move a few characters, but I think breaking “an- imo” is better than “permag- na”.
Of course, I’m no expert, and those are just my two cents.
Regardless of the decision to break it up, it should be “a-ni-mo”, not “an-imo”.
Typst uses the same hyphenation patterns TeX does. In the example, it is most likely hyphenating Latin with rules for English. Which isn’t great, but setting the language to Latin for this example also isn’t helpful in a tutorial.
I’m not disagreeing, just wondering what rule should be in invoked when hyphenating words (I assume in English, even if the example text is pseudo-Latin). Is that the second part of the hyphenated word should start with a consonant?
For extra fun, English and the fork spoken on the other side of the pond have completely different hyphenation rules. In English, hyphenation decisions are based on root and stem words, in the US version they are based on syllables.
“Two countries separated by a common language.”
I’m curious about what LaTeX is doing to get better line-breaking decisions, because that isn’t something I noticed before you pointed it out. Is it a fundamental algorithmic choice related to why LaTeX is multi-pass?
TeX hyphenation works as a combination of two things. The line breaking uses a dynamic programming approach that looks at all possible break points (word boundaries, hypenation points) and assigns a badness value for breaking lines at any combination of these and minimises it (the dynamic programming approach throws away the vast majority of the possible search space here). Break points each contribute to badness (breaking between words is fine, breaking at a hyphenation point is worse, I think breaking at the end of a sentence is better but it’s 20 years since I last tried to reimplement TeX’s layout model). Hyphenation points are one of the inputs here.
The way that it identifies the hyphenation points is particularly neat (and ML researchers recently rediscovered this family of algorithms). They build short Markov chains from a large corpus of correctly-hyphenated text that give you the probability of a hyphenation point being in a particular place. They then encode exceptions. I think, for US English, the exception list was around 70 words. You can also manually add exceptions for new words. The really nice thing here is that it’s language agnostic. As long as you have a corpus of valid words, you can generate a very dense data structure that lets you hyphenate any known word correctly and hyphenate unknown words with high probability.
All those cryptic warnings about
badness 10000finally mean something.“underfull hbox badness 10000” haunt my nightmares
Yup, there’s a configurable limit for this. If, after running the dynamic programming algorithm, the minimum badness that it’s found for a paragraph (or any box) is above the configured threshold, it reports a warning. You can also add
\sloppyto allow it to accept a higher badness to avoid writing over the margin. If you look at how this is defined, it’s mostly just tweaking the threshold badness values.I think TeX also tries to avoid rivers, right?
Yup, there are a bunch of things that contribute to badness. The algorithm is pretty general.
It’s also very simple. Many years ago, I had a student implement it for code formatting. You could add penalties for breaking in the middle of a parenthetical clause, for breaking before or after a binary operator, and so on. It produced much better output than clang-format.
Huh, it’s surprising to me that you still need an exception list. Can you fix your corpus instead so it has a bunch of examples for the exceptions?
Some words, if added to the corpus, would still get hyphenated wrongly, but their influence on the corpus would actually decrease hyphenation accuracy for all other words as well.
This mostly applies to loan words as they tend to follow different hyphenation rules than the rest of the corpus.
The corpus contains the exceptions (that’s how you know that they’re there). The compressed representation is a fixed size, independent of the size of the corpus and so will always have some exceptions (unless the source language is incredibly regular in its hyphenation rules). A lot of outliers also work because they manage to hit the highest-probability breaking points and are wrong only below the threshold value.
That’s exactly the reason why it has to be multi-pass, why it’s so slow and part of why TeX was created in the first place.
TeX ranks each possible line break and hyphenation position and tries to get the best score across an entire paragraph or even across an entire document if page breaks are involved, in contrast to MS Word which tries to get the best score for any two adjacent lines or Typst which just breaks and hyphenates whenever the line length is exceeded.
It’s worth noting that ‘slow’ means ‘it takes tens of milliseconds to typeset a whole page on a modern computer’. Most of the slowness of LaTeX comes from interpreting complex packages, which are written in a language that is barely an abstraction over a Turing machine. SILE implements the same typesetting logic in Lua and is much faster. It also implements the dynamic programming approach for paragraph placement. This was described in the TeX papers but not implemented because a large book would need as much as a megabyte of RAM to hold all of the state and that was infeasible.
SILE: https://sile-typesetter.org/what-is-sile/ if anyone was wondering.
This reminds me, I never understood why typst got so much attention while SILE seems ignored. Wouldn’t SILE be an equally good replacement for the OP?
Simon has not done a great job at building a community, unfortunately. I’m not sure why - he’s done a lot to change things for other people’s requirements but that hasn’t led to much of a SILE community. In part, he didn’t write much documentation on the internals until very recently, which made it hard to embed in other things (I’d love to implement an
NSTypesettersubclass delegating to SILE. The relevant hooks were there, but not documented). This has improved a bit.Without a community, it suffers from the ecosystem problem. It looks like it’s recently grown an equivalent of TeX’s math mode and BibTeX support, but there’s no equivalent of pfgplots, TikZ, and so on.
I don’t know that much about SILE, but Typst seems to be tackling a different issue that TeX has - awful convoluted syntax.
SILE somewhat gets around this, to be fair - it allows for XML input, which is fairly versatile! But SILE seems more oriented toward typesetting already finished works, while Typst seems to be aiming for the whole stack, even if it has less versatile typesetting.
Different focuses, I guess, though I know Typst wants to improve its typesetting quality.
Im not familiar with either SILE or Typst, but maybe the input format is better in Typst for OP?
It is not true that Typst just hyphenates whenever the line length is exceeded. When enabling justification, it uses the same algorithms as TeX both for hyphenation and for linebreaking. It’s true that hyphenation isn’t yet super great, but not because of the fundamental algorithm. It’s more minor things like selecting the best hyphenation cost etc and then there’s some other minor things like river-preventation that aren’t implemented at the moment. I agree that the hyphenation in the linked example isn’t that great. I think part of the problem is that the text language is set to English, but the text is in Latin.
As much as I would’ve liked to put it in the post itself, I didn’t think a meme would fit the aesthetic, but here you all go
https://mastodon.boiler.social/system/media_attachments/files/111/483/605/963/321/207/original/d5380782add62ce1.png
Great minds think alike https://twitter.com/btbytes/status/1726938523788509230 (Nov 21, 2023)
I like it, but going to need some progress on accessibility before I seriously adopt it.
That was an enlightening thread to read through. As you aren’t taking Typst seriously, would you mind if you ask what you are currently using? I’d assumed that you were going to say LaTeX, but the comments made it clear that LaTeX is just as bad, if not worse.
As mentioned elsewhere in the thread, it is possible to tag LaTeX documents.
But Typst has the opportunity to go beyond merely possible with enough effort. :)
🤦 I missed that part. That’s good to know about and something I’ll ensure is included and any future documents I produce.
I also share your hope that Typst takes advantage of the opportunity that they have for better accessibility.
We definitely plan to make Typst documents accessible in the future!
That’s really good to hear!
It might be possible to circumvent native support for this feature for now: typst allows for arbitrary metadata (I have used it for creating notes for a presentation), that can later be queried into a json-like format. Maybe adding some structured metadata to your typst file, querying the relevant info, compiling the pdf and adding the metadata to the pdf with some external tool could be made into a shell-script/tiny wrapper program.
Just an idea, of course native support would be the best on the long run.
Typst looks really promising. I hope it gains more traction.
I have a two-page paper on a minor point of complex differential geometry I typeset whenever I try out a new typesetting system. It only has two unnumbered theorems, a few displayed formulas, and a couple of citations, so it is a low bar to clear. (Even so, some systems do not clear it. Looking at you, groff.)
Typesetting it in Typst was fairly painless. The math syntax is a lot nicer than that of LaTeX, even though I burned those commands into my muscle memory a long time ago. I didn’t look for external packages or things to deal with the theorem environments and proofs, so I just typeset them by hand a la TeX. That was easy and the upgrade path to something more LaTeX-like was clear.
Once we get into the weeds, I do think that LaTeX’s many many decades of polish start to show. I used to work at a scientific publisher, so I have opinions about typesetting. They bump against some things here. A minor one is that I don’t like the sizes Typst (automatically) picks for displayed nested parentheses. I also don’t like the sizes LaTeX’s
\left(and\right)macros pick either, so this isn’t a major point against Typst.A larger point against it here is that once I wanted to typeset proofs with the traditional tombstone/Halmos at the end, I ran into horizontal space control issues I couldn’t quite figure out. I wanted to place the tombstone aligned with the right margin of the page. If my proof ended in text mode I could do this by ending the proof like this:
This is pretty nice. However I couldn’t figure out how to get Typst to do the same when the proof ended on a displayed equation. For example, this didn’t work:
I resorted to binary searching a suitable point size for the horizontal spaces by hand, but There Has Got To Be A Better Way.
All in all pretty interesting. I’m not sure whether the system itself needs the second 90% of work that LaTeX has already had, or whether I need to read the docs more. It’s a remarkable achievement for Typst to have gotten this far.
Funnily, the typst author’s masters thesis (which I assume was set in typst) has an image instead of text at page 12. I wonder if they hid a problem that required this workaround.
That was just Acrobat being weird when converting to PDF/A.
ah, yes, the prime factor in my technology choices
For those liking both systems, there’s also https://github.com/fenjalien/obsidian-typst which hopefully will get better and better with time.
How were you writing Latex in Obsidian?
https://help.obsidian.md/Editing+and+formatting/Advanced+formatting+syntax#Math
Really like Typst, have used it for building a CV but would be interested if in the future it’s plausible to have a html converter. Could see it used for static blogs
Pandoc can convert it to HTML and all the other outputs pandoc supports
I usually rarely have to touch LaTeX, as I write my text in org-mode and let it compile to Latex using org-export. I wonder what the level of support for that workflow is?