1. 22
  1.  

  2. 10

    There are some weird things in the specification.

    A new open bracket will cancel any preceding unmatched open bracket of its kind.

    This suggests that, for example, *foo and *bar* will get “correctly” processed into *foo and <strong>bar</strong>. As the user, I would rather get a warning and be invited to escape the first star, because this is likely to be a mistake on my part. (The “implicit cancellation” rule is not very Strict).

    The only form of links CommonMark supports is full reference links. The link label must be exactly one symbol long.

    So you cannot write [foo](https://example.com), you have to write [foo][1]. Fine with me. But then “one symbol long”? [foo][1] is allowed but [foo][12] is not, the document recommends using letters above ten references, so [foo][e] is okay but [foo][example] is not.

    I think that this limitation comes from trying to make it easy to parse StrictMark with fairly dumb parser technology. Honestly, while I agree that 10K-lines hand-written parsers are not the way to go for a widely-used document format, I would rather have a good specification that is paired with some tutorials on how to implement decent parsing approaches (for example, recursive-descent on a regex-separated token stream) for unfamiliar programmers, rather than annoying choices in the language design to support poor technical choices.

    1. 5

      I totally agree. It would make much more sense to have a limitation of a set of digits with no spaces ([12], and [0001] are acceptable) than a single symbol.

      1. 3

        I agree. To make matters worse, the specification says “one symbol wide”. Sadly, “symbol” does not have a strict definition when it comes to text encoding or parsing. The text can be UTF-16 encoded, where one symbol is actually 2 or more codeunits. Symbols might be language-dependent, a Czech or Slovak reader might consider “ch” to be one symbol, a dutch reader might consider “ij” to the one symbol. UTF-8 everywhere fans might be dismayed to know that certain symbols are encoded as multiple codepoints by unicode itself, so for example while “ю́” (cyrillic small letter yu with acute) looks, walks and sounds like one symbol, but it’s encoded as by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.

        I think the closest thing to what the author intended is “grapheme cluster”, roughly, whatever you can highlight as one unit of text using your cursor is your one symbol. Good luck implementing that in a parser though.

        1. 1

          a dutch reader might consider “ij” to the one symbol

          Certainly in the context of computers, I think very few people would, if any, since it’s always written as “i j”. Outside of that, things are a bit more complicated and it’s a bit of a weird/irregular status, but this isn’t something you really need to worry about in this context.

          There’s a codepoint for it, but that’s just a legacy ligature codepoint, just like (U+FB00) for ff, (U+FB06) for st, and a bunch of others. These days ligatures are encoded in the font itself and using the ligature codepoints is discouraged.

          The text can be UTF-16 encoded, where one symbol is actually 2 or more codeunits

          This has nothing to do with UTF-16, which is functionally identical to UTF-8, except that it encodes the codepoints in a different way (2 or 4 bytes, instead of 1 to 4 bytes). I don’t know what you mean with “one symbol is actually 2 or more codeunits” as that’s a Unicode feature, not a UTF-16 feature.

          UTF-8 everywhere fans might be dismayed to know that certain symbols are encoded as multiple codepoints by unicode itself

          Yes, and this works fine in UTF-8?

          I think the closest thing to what the author intended is “grapheme cluster”, roughly, whatever you can highlight as one unit of text using your cursor is your one symbol. Good luck implementing that in a parser though.

          Most languages should have either native support for this or a library for it, and it’s actually not that hard to implement.

          They did mean “codepoint” though, as that is what is in the grammar:

          PUNCT = "!".."/" | ":".."" | "[".."`" | "{".."~";
          WS = [ \t\r\n];
          WSP = WS | PUNCT;
          LINK_LABEL = CODEPOINT - WSP - "]";
          

          You probably want to restrict this a bit a bit more; there’s much more “white space” and “punctuation” than just those listed, and using control characters, combining characters, format characters, etc. could lead to some very strange rendering artefacts. All of this should really be based on Unicode categories.

          1. 1

            My main point is I can see how a naive implementation might use the built-in length function to check if something is one “symbol” long and it will fail in non-obvious ways for abstract characters that one might consider to be one character long.

            Most languages should have either native support for this or a library for it, and it’s actually not that hard to implement.

            Except they don’t. Here’s an example, the following string consists of 16 grapheme clusters (including spaces), but anywhere from 20 to 22 codepoints.

            Приве́т नमस्ते שָׁלוֹם

            I invite you to use any of your tools that you think would handle this correctly and tell me if any do. And this example is without resorting to easy gotchas, like combining emojis “👩‍👩‍👦‍👦”.

            1. 2

              My main point is I can see how a naive implementation might use the built-in length function to check if something is one “symbol” long

              Well in this case that would be correct as the specification says it’s a single codepoint.

              I invite you to use any of your tools that you think would handle this correctly and tell me if any do.

              Searching “graphmeme ” should turn up a library. Some languages have native support (specifically, IIRC Swift has, and I thought Rust too but not sure) and others may include some support in stdlib. Like I said, this is not super-hard to implement; the Unicode specifications always make this kind of stuff seem harder than it actually is because of the way they’re written, but essentially they just have a .txt file which lists codepoints that are “graphmeme break characters” and the logic isn’t that hard.

      2. 7

        I’m always a fan of stricter languages, and the effort on this project is much appreciated. Some of these restrictions strike me as odd considering the aims of the project, though. For instance, why must all markings occupy at least the first 4 characters of the line in order to work? Why can’t # This be the syntax for a heading?

        1. 4

          This seems IMHO counter to the spirit and purpose of Markdown. It’s supposed to be an informal, easy-to-remember syntax, not a formal markup language; I think the saying goes “you already know how to write it!”

          Creating a strict grammar means syntax errors, right? Or at least cases where nothing happens because your syntax wasn’t totally correct. Removing duplicate features means some people’s existing knowledge or intuition breaks because this implementation removed their preferred syntax.

          I totally get that writing a Markdown parser must be annoying because the syntax is vague and illogical. But that’s because it’s designed for the convenience of puny humans, who are vague and illogical. (Especially the ones who aren’t coders!)

          1. 3

            Interesting! I’m also working on a stricter Markdown alternative, but it’s more of a toy (link). Highly agree about the link format. One of my biggest gripes with Markdown is how difficult it is to read with huge web addresses in the middle of the text.

            1. 2

              Are you aware that Reference-style Links exist? I don’t know how widely supported that is though.

              1. 1

                Yeah, I know, and I almost always use them. The problem is that most other people don’t!

            2. 2

              There is an XKCD for this…