1. 39
    1. 8

      Regarding Table of Contents generation, what do you think about hxtoc(1) which is part of the HTML/XML utilities by the w3c?

      Also, I’ve made a similar experience regarding a joyful discovery of CommonMark recently, but instead of using the parser you mention, I’ve taken up lowdown as my client of choice. I guess this is something it has in common with most C implementations of markdown, but especially when compared to pandoc, it was fast. It took me a fraction on a second to generate a website, instead of a dozen or more. So I guess, I wanted to see, what other clients you’ve looked into, for example discount, as an example of an another popular implementation.

      1. 5

        Hm, I’ve actually never heard of hxtoc, lowdown, or discount!

        I haven’t been using Markdown regularly for very long. I started it using more when I started the Oil blog in late 2016. Before that, I wrote a few longer documents in plain HTML, and some in Markdown.

        I had heard of pandoc, but never used it. I guess cmark was a natural fit for me because I was using markdown.pl in a bunch of shell scripts. So cmark pretty much drops right in. I know a lot of people use framework-ish static site generators, which include markdown. But I really only need markdown, since all the other functionality on my site is written with custom scripts.

        So I didn’t really do much research! I just felt that markdown.pl was “old and smelly” and I didn’t want to be a hypocrite :-) A pile of shell scripts is pretty unhygienic and potentially buggy, but that is what I aim to fix with Oil :)

        That said, a lot of the tools you mention do look like the follow the Unix philosophy, which is nice. I would like to hear more about those types of tools, so feel free to post them to lobste.rs :) Maybe I don’t hear about them because I’m not a BSD user?

        1. 4

          I had heard of pandoc, but never used it.

          It’s a nice tool, and not only for working with Markdown, but tons of other formats too. But Markdown is kind of it’s focus… If you look at it’s manual, you’ll find that it can be very finely tuned to match ones preferences (such as enabling or disabling raw HTML, syntax highlighting, math support, BibLaTeX citations, special list and table formats, etc.). It even has output options that make it resemble other implementations like PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown and also markdown.pl! Furthermore, it’s written by John MacFarlane, who is one of the guys behind CommonMark itself. In fact if you look at the cmark contributers, he seems to be the most active maintainer.

          I usually use pandoc to generate .epub files or to quickly generate a PDF document (version 2.0 supports multiple backends, besides LaTeX, such as troff/pdfroff and a few html2pdf engines). But as I’ve mentioned, it’s a bit slow, so I tend to not use it for simpler texts, like when I have to generate a static website.

          I know a lot of people use framework-ish static site generators, which include markdown.

          Yeah, pesonally I use zodiac which uses AWK and a few shell script wrappers. You get to choose the converter, which pipes some format it, and HTML out. It’s not ideal, but other than writing my own framework, it’s quite ok.

          Maybe I don’t hear about them because I’m not a BSD user?

          Nor am I, at least not most of the time. I learned about those HTML/XML utilities because someone mentioned them here on lobste.rs, and I was supprised to see how powerful they are, but just how seemingly nobody knows about them. hxselect to query specific elements in a CSS-fashion, hxclean as an automatic HTML corrector, hxpipe/hxunpipe converts (and reconverts) HTML/XML to a format that can be more easily parsed by AWK/perl scripts – certainly not useless or niche tools.

          But I do have to admit that a BSD user influenced me on adopting lowdown, and since it fits my use-case, I stick by it. Nevertheless, I might take a look at cmark, since it seems interesting.

      2. 2

        Unfortunately, it looks like lowdown is a fork of hoedown which is a fork of sundown which was originally based on the markdown.pl implementation (with some optional extensions), and is most likely not CommonMark compliant. Pandoc is nice because it can convert between different formats, but it also has quite a few inconsistencies.

        One of the biggest reasons I like CommonMark is because it aims to be an extremely solid, consistent standard that makes markdown more sane. It would be nice to see more websites move towards CommonMark, but that’s probably a long shot.

        Definitely check out babelmark if you get a chance which lets you test different markdown inputs against a bunch of different parsers. There are a bunch of example divergences on the babelmark FAQ. The sheer variety of outputs for some simple inputs is precisely why CommonMark is useful as a standard.

        1. 3

          Lowdown isn’t CommonMark conformant, although it has some bits in place. The spec for CommonMark is huge.

          If you’re a C hacker, it’s easy to dig into the source to add conformancy bit by bit. See the parser in document.c and look for LOWDOWN_COMMONMARK to see where bits are already in place. The original sundown/hoedown parser has been considerably simplified in lowdown, so it’s much easier to get involved. I’d be thrilled to have somebody contribute more there!

          In the immediate future, my biggest interest is in going an LCS implementation into the lowdown-diff algorithm. Right now it’s pretty ad hoc.

          (Edit: I’m the author of lowdown.)

        2. 2

          One of the biggest reasons I like CommonMark is because it aims to be an extremely solid, consistent standard that makes markdown more sane. It would be nice to see more websites move towards CommonMark, but that’s probably a long shot.

          I guess I can agree with you when it comes to websites like Stackoverflow, Github and Lobsters having Markdown formatting for comments and other text inputs, but I really don’t see the priority when it comes to using a not 100% CommonMark compliant tool for your own static blog generator. I mean, it’s nice, no doubt, as long as you don’t intentionally use uncertain constructs and don’t over-format your texts to make them more complicated than they have to be, I guess that most markdown implementations are find in this regard – speed on the other hand, is a different question.

          1. 1

            Are you saying that CommonMark should be used for comments on websites, but not for your own blog?

            I would say the opposite. For short comments, the ambiguity in Markdown doesn’t seem to be a huge problem, and I am somewhat comfortable with just editing “until it works”. I don’t use very many constructs anyway – links, bold, bullet points, code, and block code are about it.

            But blogs are longer documents, and I think they have more lasting value than most Reddit comments. So although it probably wasn’t strictly necessary to switch to cmark, I like having my blog in a format with multiple implementations and a spec.

            1. 3

              At least in my opinion, its useful everywhere, but more so for comments, because it removes differences in implementations. Often times the people using a static site generator are developers and can at least understand differences between implementations.

              That being said, I lost count of how many bugs at Bitbucket were filed against the markdown parser because the library used resolves differences by following what markdown.pl does. I still remember differences in bbcode parsing between different forums - moving to a better standard format like markdown has been a step in the right direction… I think CommonMark is the next step in the right direction.

            2. 1

              The point has already been brought up, but I just want to stress it again. You will probably have a feeling for how your markup parser works anyway, and you will write according. If your parser is commonmark compliant, that’s nice, but really isn’t the crucial point.

              On the other hand, especially if one likes to write longer comments, and uses a bit more than the basic markdown constructs on websites, having a standar to rely on does seem to me to offer an advantage, since you don’t necessary know what parser is running in the background. And if you don’t really use markdown, it doesn’t harm you after all.

    2. 7

      This post not satire. The internet has so driven me toward cynicism that I expected this to be a satirical piece about how terrible it all is. Instead it was a pretty nice overview. I comment here in the event any others who primarily read through email and might not see every comment might see this one, and will not make the same mistake I did.

    3. 3

      Another technique I’ve wanted to explore, but haven’t yet, is property-based testing. As far as I understand, it’s related to and complementary to fuzzing.

      One of the best resources I’ve found for property based testing is this. It’s target audience is just-barely-not-beginners, and really helps get over the hump of “What properties do I even write?”

    4. 3

      Nitpick:

      <p>"Oil"</p><p>&quot;Oil&quot;</p>. The former might be valid HTML, but the latter is better. (The former is also not valid XML.)

      The former is a perfectly valid[1] XML, there’s nothing wrong with " outside tags.

      [1]: More correctly, it’s well-formed. “Valid” only has meaning against a specified DTD schema, which is absent here. But if we assume it’s HTML, then it’s also a valid HTML fragment.

      1. 3

        Yes, you’re right, I made a correction:

        http://www.oilshell.org/blog/2018/02/14.html#toc_2

    5. 2

      I wanted to parse <h1>, <h2>, … headers in the HTML output in order to generate a table of contents, like the one at the top of this post.

      Since I could’t find your script, here’s the one I’m using on my homepage. This perl script expects headings in the style <h2 id=foo>.

      #!/usr/bin/env perl
      
      print "<ul>\n";
      $depth = 2;
      while(<>) {
      	if (/<h([2-3]) +id=["']?(.+?)["']?>(.+?)<\/h\1>/) {
      		print " "x($1-2)."<ul style=\"padding:0 0 0 1em\">\n" if ($1 > $depth);
      		print " "x($1-2)."</ul>\n" if ($1 < $depth);
      		print " "x($1-2)."<li><a href=\"#$2\">$3</a>\n";
      		$depth = $1;
      	} elsif (/<h([2-3])>/) {
      		print STDERR "! No id attribute on line $.: $_";
      	}
      }
      print "</ul>\n";
      
    6. -7

      “CommonMark compliant” … lol yeah, forget it.

      it’s a fake “standard.” They didn’t think of Markdown, invent it, or do anything to help it’s advance.

      No one needs to “comply” with “CommonMark. The CommonMark project has been trying to take ownership of Markdown for years. It’s ridiculous and annoying.

      1. 8

        They didn’t think of Markdown, invent it, or do anything to help it’s advance.

        You say “it”, when one of the reasons CommonMark exists, is that there is no single “it”, there are divergent implementations/extensions. The original implementation is not a specification, and it ‘has bugs’/is ambiguous.

        I am not too well-informed; who is better suited to “do anything to help it’s advance”? And what did CM do wrong?

        1. 5

          I presume what leeflannery is referring to is that John Gruber is BDFL of Markdown and the only authority. Only he’s more like the Absent Dictator for Life, which resulted in a proliferation of implementations that sometimes conflicted with each other and Markdown and led Jeff Atwood (of StackOverflow) to establish CM.

          Atwood’s story is here - https://blog.codinghorror.com/standard-markdown-is-now-common-markdown/

          And here’s the Github issue with some more detail - https://github.com/commonmark/CommonMark/issues/19

      2. 6

        it’s a fake “standard.” They didn’t think of Markdown, invent it, or do anything to help it’s advance.

        Every standard is fake, if you want to be pedantic about it. No C compiler has to comply with the ANSI standard, no web server has to implement the HTTP spec. Nor are these standards set in stone (good ones at least), and they develop with trends, new needs and realization of previous shortcomings.

        What the CommonMark project want to achieve isn’t to make up some unrelated markup language, or to feel special for themselves, but “propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations”. And as I’ve already mention in this thread, these people aren’t nobodies, but instead it was initiated by some of the more major figures in the “markdown scene”. Sure, they don’t “own” markdown (whatever that is supposed to mean), but they are proposing a common ground to strongly define the syntax and the semantics of a markdown parser, having have already published revisions, updating their specification.

        If it’s a good standard, people will adopt it when doing something related to markdown, otherwise they won’t. This doesn’t look like something “stupid”, if you were to ask me, but rather an incentive to create a well defined, common sensical, sane specification, to improve the current state of markdown – and if one doesn’t like it, there’s absolutely no need to worry about it or pay any attention whatsoever to the project.