I was curious about how the CommonMark specification deals with this problem. Looks like it is quite tricky, but they did manage to come up with a bunch of rules to eliminate the ambiguity, while still allowing efficient parsing: https://spec.commonmark.org/0.29/#emphasis-and-strong-emphasis
It has a whopping 130 examples to show the corner cases!
I’m generally worried about how underspecified Markdown seems to be. Different parsers sometimes do very different things and I wish there was a generally accepted, completely rigorous standard that is deterministic down to the last character. A language that is this ambiguous doesn’t fill me with a lot of confidence. The most vexing parse is ambiguous on a syntactic level, but not on a semantic one - an important difference between C++ and Markdown. I don’t understand why it’s always the markup languages that cause this mess, nobody would put up with having no proper C++ standard because “different people have different needs”. How does that square with interoperability? If you need to serve specific needs, you can create a rigorous language that has well defined entry points for plugins and extensions. There’s no need to ever go down the ambiguity route.
In general I think that Markdown is good format for blogposts, as it is quite simple to read and write, and in that case there will be only one parser for that content. On the other hand I completely do not understand the trend of using Markdown (or CommonMark) for documentation. Markdown is just poor choice there as:
It is underspecified in original form. CommonMark solves this a little.
It has not extensibility which causes people to need to hack around the different features.
It is mostly HTML-centric so it lacks some features that are useful for non-HTML documentation, like manpages, PDFs, etc.
The next time I read “Programs are meant to be read by humans and only incidentally for computers to execute”, I might refer them to Markdown.
Markdown’s author (John Gruber) is quite clear that he does not care about whether it has a formal grammar. When edge cases have been found, he might try to fix them (in his implementation), but will just as likely say that as the human author of markdown, you should avoid those parts of the language. He’s like the sorcerer’s apprentice, if the apprentice looked around at all the animated brooms splashing water and said “seems fine to me”. Perhaps he’s even right about that, it’s just a way of viewing the language that conflicts with what many programmers expect.
Markdown is not a programming language that translates human-readable text into machine-interpreted code, it’s a markup that translate human-readable text into another form of human-readable text (rendered HTML).
Gruber isn’t the apprentice in this case, he’s the sorcerer. The apprentices are the people who insist that Markdown should be extended from its original role as a lightweight blog and comment format to be the end-all and be-all of documentation formats, covering every corner case imaginable.
IMHO Gruber did absolutely the right thing when he insisted on the CommonMark crew not to use the name Markdown. CM is a great project and it’s nice to have a formal spec, but it’s not the original Markdown that Gruber created, and that I’m still using as a Perl plugin to my Blosxom blog.
I don’t really make a distinction between markup and program. It’s a language that’s interpreted by the machine to specify a transformation between data formats. It differs from a C/Haskell/Lisp/Perl/APL in many other ways, of course.
I think Gruber’s vision has some logic to it, but it’s really stressed by the wide use of Markdown. It would be easier if every user of Markdown was like Gruber–publishing on their own blog, using software they control. Instead, there are a lot of third-party sites that incorporate user-generated content in Markdown. That pushes you towards an unambiguous standard. It probably would’ve been impossible, but the right thing might have been to say “please don’t use the project like that.”
I don’t really make a distinction between markup and program. It’s a language that’s interpreted by the machine to specify a transformation between data formats.
I can see your point. In this particular case, I’d argue that the distinction is fuzzy. Whether *text* or **text** shows up as italic or bold doesn’t have any real semantic meaning apart from “emphasis”.
Gruber just wanted to give users the options to use asterisks or underscores to provide emphasis, because that was the main drift of email formatting. Had he been more of a programmer and less of a pragmatist, he’d might have mandated the one or the other to avoid ambiguity. Instead he embraced it.
Markdown may be a terribly specced programing language/transform, but it’s a great, easy to use[1] tool to transform plain text into valid HTML. It generally Does What You Mean^TM. That’s why it won out.
[1] apart from stuff like specifying image links, and out-of-band links…
When I came across Markdown for the first time, like a lot of people I thought it was awesome and I wanted to use it everywhere, I wanted an official specification and a test suite and compatible libraries for every language and platform. When I heard about the CommonMarkdown standardisation proposal, I was excited.
When Gruber bluntly refused to participate, I was frustrated. When he requested (if I recall correctly) that the proposal be renamed to not include “Markdown” (it was renamed to CommonMark), I was angry. He published the first implementation, and advertised it! Didn’t he want people to use it? Didn’t he want to see it flourish? Was he just bitter about losing control?
Eventually, though, it occurred to me (although I have no way to verify this) that perhaps Markdown became the thing Gruber had intended Markdown to destroy. I think the real idea behind Markdown was not a particular syntax, but the idea of a light-weight, human-friendly interface tailored to a particular use-case. So, a buggy Perl module that can be easily hacked on and extended is Markdown, but a rigorous CommonMark implementation is not. The original AsciiDoc, which allowed custom markup definitions, has the Markdown nature; Asciidoctor, which just codifies AsciiDoc’s defaults, does not. Chuck Moore’s multitude of Forth variants have the Markdown nature, ANSI Forth does not. Breadboards and wire-wrapping have the Markdown nature, printed-circuit boards and surface-mount components do not.
It’s not that rigorous definition and standardisation are bad; they can be good, great, and crucially important, but they’re a trade-off. To obtain the benefits of standardisation, something else must be left behind.
The bigger question for me was always “How did markdown get so popular to start with?” It has a lot of widely known flaws and people love to complain about corner cases, the different parsers, etc. So why did everyone start using it to begin with?
My take: it got big because it was easy to use, the tool (singular) worked for MovableType which was huge at the time, and John Gruber had a big audience in the much smaller world of blogging back then.
I honestly think that was it. When websites a bit later wanted to allow simple formatting via plain text (as opposed to rich text edit boxes or bbcode or similar), they looked around, saw that Markdown was out there and much more known than the alternatives, and it kind of snowballed from there.
I don’t know when Reddit adopted Markdown for commenting and posting, but it must have been a big impetus too.
Oh, I see what you’re saying now. Mea culpa. I think there’s still some ambiguity though with how **** should be interpreted, as normal-chars can be empty.
Hmm, I see. It could be either (**)(**) or (****). However, there is no empty bold or italic block; so we could also require at least a single char here, and make the empty a part of the text-run.
And what is the proper way to match **this **text**? I entered it into Babelmark, and it seems like there are substantial differences even between the most well-known Markdown parsers.
You are right, this does break my posted grammar. I think I can fix it with a leveled grammar though – i.e strong2 only allows normal text or a single level em, etc. Will post an update if I find it.
Asciidoc, although an improvement compared to markdown, is the same. Very frustrating if you’re trying to build tooling. All the parsers are unique in their own special way, and almost none of them will give you a model back, they’re almost entirely focused on text -> text transforms.
I was curious about how the CommonMark specification deals with this problem. Looks like it is quite tricky, but they did manage to come up with a bunch of rules to eliminate the ambiguity, while still allowing efficient parsing: https://spec.commonmark.org/0.29/#emphasis-and-strong-emphasis
It has a whopping 130 examples to show the corner cases!
I’m generally worried about how underspecified Markdown seems to be. Different parsers sometimes do very different things and I wish there was a generally accepted, completely rigorous standard that is deterministic down to the last character. A language that is this ambiguous doesn’t fill me with a lot of confidence. The most vexing parse is ambiguous on a syntactic level, but not on a semantic one - an important difference between C++ and Markdown. I don’t understand why it’s always the markup languages that cause this mess, nobody would put up with having no proper C++ standard because “different people have different needs”. How does that square with interoperability? If you need to serve specific needs, you can create a rigorous language that has well defined entry points for plugins and extensions. There’s no need to ever go down the ambiguity route.
Worth noting https://commonmark.org/. This specification is implemented by Reddit, GitHub, GitLab, Discourse, etc.
It was designed as a lightweight format for blogging and commenting. There are plenty of alternative markup languages that are better specified.
In general I think that Markdown is good format for blogposts, as it is quite simple to read and write, and in that case there will be only one parser for that content. On the other hand I completely do not understand the trend of using Markdown (or CommonMark) for documentation. Markdown is just poor choice there as:
The next time I read “Programs are meant to be read by humans and only incidentally for computers to execute”, I might refer them to Markdown.
Markdown’s author (John Gruber) is quite clear that he does not care about whether it has a formal grammar. When edge cases have been found, he might try to fix them (in his implementation), but will just as likely say that as the human author of markdown, you should avoid those parts of the language. He’s like the sorcerer’s apprentice, if the apprentice looked around at all the animated brooms splashing water and said “seems fine to me”. Perhaps he’s even right about that, it’s just a way of viewing the language that conflicts with what many programmers expect.
Markdown is not a programming language that translates human-readable text into machine-interpreted code, it’s a markup that translate human-readable text into another form of human-readable text (rendered HTML).
Gruber isn’t the apprentice in this case, he’s the sorcerer. The apprentices are the people who insist that Markdown should be extended from its original role as a lightweight blog and comment format to be the end-all and be-all of documentation formats, covering every corner case imaginable.
IMHO Gruber did absolutely the right thing when he insisted on the CommonMark crew not to use the name Markdown. CM is a great project and it’s nice to have a formal spec, but it’s not the original Markdown that Gruber created, and that I’m still using as a Perl plugin to my Blosxom blog.
I don’t really make a distinction between markup and program. It’s a language that’s interpreted by the machine to specify a transformation between data formats. It differs from a C/Haskell/Lisp/Perl/APL in many other ways, of course.
I think Gruber’s vision has some logic to it, but it’s really stressed by the wide use of Markdown. It would be easier if every user of Markdown was like Gruber–publishing on their own blog, using software they control. Instead, there are a lot of third-party sites that incorporate user-generated content in Markdown. That pushes you towards an unambiguous standard. It probably would’ve been impossible, but the right thing might have been to say “please don’t use the project like that.”
I can see your point. In this particular case, I’d argue that the distinction is fuzzy. Whether
*text*
or**text**
shows up as italic or bold doesn’t have any real semantic meaning apart from “emphasis”.Gruber just wanted to give users the options to use asterisks or underscores to provide emphasis, because that was the main drift of email formatting. Had he been more of a programmer and less of a pragmatist, he’d might have mandated the one or the other to avoid ambiguity. Instead he embraced it.
Markdown may be a terribly specced programing language/transform, but it’s a great, easy to use[1] tool to transform plain text into valid HTML. It generally Does What You Mean^TM. That’s why it won out.
[1] apart from stuff like specifying image links, and out-of-band links…
When I came across Markdown for the first time, like a lot of people I thought it was awesome and I wanted to use it everywhere, I wanted an official specification and a test suite and compatible libraries for every language and platform. When I heard about the CommonMarkdown standardisation proposal, I was excited.
When Gruber bluntly refused to participate, I was frustrated. When he requested (if I recall correctly) that the proposal be renamed to not include “Markdown” (it was renamed to CommonMark), I was angry. He published the first implementation, and advertised it! Didn’t he want people to use it? Didn’t he want to see it flourish? Was he just bitter about losing control?
Eventually, though, it occurred to me (although I have no way to verify this) that perhaps Markdown became the thing Gruber had intended Markdown to destroy. I think the real idea behind Markdown was not a particular syntax, but the idea of a light-weight, human-friendly interface tailored to a particular use-case. So, a buggy Perl module that can be easily hacked on and extended is Markdown, but a rigorous CommonMark implementation is not. The original AsciiDoc, which allowed custom markup definitions, has the Markdown nature; Asciidoctor, which just codifies AsciiDoc’s defaults, does not. Chuck Moore’s multitude of Forth variants have the Markdown nature, ANSI Forth does not. Breadboards and wire-wrapping have the Markdown nature, printed-circuit boards and surface-mount components do not.
It’s not that rigorous definition and standardisation are bad; they can be good, great, and crucially important, but they’re a trade-off. To obtain the benefits of standardisation, something else must be left behind.
The bigger question for me was always “How did markdown get so popular to start with?” It has a lot of widely known flaws and people love to complain about corner cases, the different parsers, etc. So why did everyone start using it to begin with?
My take: it got big because it was easy to use, the tool (singular) worked for MovableType which was huge at the time, and John Gruber had a big audience in the much smaller world of blogging back then.
I honestly think that was it. When websites a bit later wanted to allow simple formatting via plain text (as opposed to rich text edit boxes or bbcode or similar), they looked around, saw that Markdown was out there and much more known than the alternatives, and it kind of snowballed from there.
I don’t know when Reddit adopted Markdown for commenting and posting, but it must have been a big impetus too.
Re the first point; why not
It wouldn’t properly match
**this **text**
.I was only talking about the first point; which is talking about ambiguity between double em and strong
Oh, I see what you’re saying now. Mea culpa. I think there’s still some ambiguity though with how
****
should be interpreted, asnormal-chars
can be empty.Hmm, I see. It could be either (**)(**) or (****). However, there is no empty bold or italic block; so we could also require at least a single char here, and make the empty a part of the text-run.
I suddenly get why language designers love fuzzers so much.
(Edit: because finding ambiguous cases here is hard on my brain)
And what is the proper way to match
**this **text**
? I entered it into Babelmark, and it seems like there are substantial differences even between the most well-known Markdown parsers.Here’s an edge case that breaks this grammar:
which renders as Bold and Italic using Lobsters’ Markdown renderer
You are right, this does break my posted grammar. I think I can fix it with a leveled grammar though – i.e strong2 only allows normal text or a single level em, etc. Will post an update if I find it.
Asciidoc, although an improvement compared to markdown, is the same. Very frustrating if you’re trying to build tooling. All the parsers are unique in their own special way, and almost none of them will give you a model back, they’re almost entirely focused on text -> text transforms.