1. 45
  1.  

  2. 14

    I can’t tell what problem this is supposed to be solving. The problem statement is that troff markup is good and minimal. But then instead of just writing a five line script to turn troff style markup into HTML, it talks about how you can use the quirks of HTML to write more minimal syntax. Okay, I guess? But it still sucks compared to troff and you can’t use it for anything but hobby sites without driving yourself crazy. As it is Markdown has been implemented a bazillion times and if you just want to use Markdown, you can find a tool to use it.

    1. 7

      I was afraid it would go that way. It would be mostly a toy since troff is obscure and less expressive than html, it would indeed have no upside compared to markdown.

      However, this approach is intriguing because it gives you the full power of html (and the elements like details/summary, definition lists, proper tables, citations, etc.) in a reasonably writable way. To do the same with markdown you need to inject a lot of html elements in it. I think this is a valid approach and I found the article surprisingly pleasant.

      1. 4

        What’s wrong with including HTML in Markdown? I do it all the time when I want tables etc.

        Edit Markdown shines when including links in running text. The affordance to include a link in the form of [text][slug], where ‘slug’ is defined somewhere else in the document, is something that makes organizing a mass of links really easy.

        1. 1

          I prefer markdown tables to html tables. I’ve used html for svg with png fallback, and I’m looking at it for pseudocode with italics.

          1. 1

            To be honest I’ve never bothered to learn MD tables, and whenever I have anything more complex than a couple of rows I use a program to generate the HTML…

        2. 1

          Troff can do more than straight markdown, has less markup than HTML, and is more consistent than mixing markdown and HTML. So those are upsides.

        3. 6

          I’m not sure it’s fair to call these “quirks of HTML”. They are explicitly part of the standard, and are just as correct as any other valid HTML.

          1. 3

            The WHATWG spec makes a clear distinction between HTML one is supposed to write and HTML that should be handled by parsers, so it does pass judgment on certain constructs.

        4. 14

          Thanks to Aaron for posting this. Such a great reminder.

          Anyone interested in this subject, check out a series of three very tiny books called “UPGRADE YOUR HTML” by Jens Oliver Meiert.

          They give great step-by-step examples for eliminating optional tags and attributes, reducing HTML to its cleanest simplest valid form. The author is a super-expert in this specific subject, working with Google and W3C on this. His bio here: https://meiert.com/en/biography/

          From LeanPub: https://leanpub.com/b/upgrade-your-html-123

          From Amazon: https://www.amazon.com/gp/product/B08NP4GXY2/

          1. 2

            Oh, wow! Thank you for sharing these books. They look like they’d be right up my alley.

          2. 10

            If more people understood HTML, they would stop using Markdown incorrectly—such as using blockquotes for callouts and admonitions instead of for quoting a body of work, or figures + figcaptions, definiton lists instead of unordered lists with heading elements. Ironic that developers tease people for using WYSIWYG editors improperly—like <b> for headings and excessive <br> tags and &nbsp;—and then misuse their lightweight markup syntaxes.

            1. 23

              If I could give young developers one piece of advice, it would be this:

              Reading the manual or the standard or the RFC or whatever the definitive documentation is for the tools that you use is a super-power.

              1. 4

                As soon as markdown provides me with callout and admonition formats, (or even a reasonable standard way of defining them myself) I’ll be happy to fix that.

                1. 3

                  Pretty sure markdown is a superset of HTML, but I don’t see how HTML would solve that problem? Is there an <admonition> tag?

                  1. 2

                    We had something similar way back, but sadly the “blink” element has since been deprecated…

                    1. 2

                      Ah, fond memories of <marquee> :’D

                    2. 2

                      Even if it did, escaping to HTML isn’t my idea of a reasonable way – if I wanted HTML I’d be using it, not markdown. It’s a usable kludge/escape hatch when you need it, but html in markdown isn’t a solution.

                      1. 6

                        The original Markdown allowed HTML because the point of the original Markdown was to eliminate the tedious bits of HTML when authoring blog posts, not to replace HTML entirely.

                        1. 3

                          I know why it’s there… my point was that if the goal is to eliminate said tedium, and there’s a markdown format code that generates the output I want without it, why would I go back to html for some purported semantic purity.

                          Aside from that, the current reality of markdown has strayed considerably from that original point, and now many uses of it no longer support html at all or significantly restrict it.

                          1. 1

                            To each their own I guess. I have my own markup language I use for blogging [1] and it has support for generating tables (some example tables). But it doesn’t support all the semantics of HTML tables, and when I need those (like this example) the ability to drop down into actual HTML is nice. Did I go back to HTML for some purported semantic purity? I don’t think so, but you may think differently. And would I go to the trouble to try to support all of HTML’s table semantics in my markup language? No way. For me, it’s rare enough that I stray from the general table format (of the first example) that supporting other formats would be a waste of time—the occasional “drop down to HTML markup” doesn’t bother me that much.

                            [1] I don’t store the post entries in this format, but in the final HTML output, mainly because I want the ability to change the markup language if some aspect annoys me. I’ve already made multiple changes. I’m also not forcing others to use this format, since it’s really geared towards my own usage.

                            1. 2

                              Did I go back to HTML for some purported semantic purity? I don’t think so, but you may think differently

                              Sorry, didn’t mean to imply that’s what you were doing… rather just referring to my original point upthread: I see no issue at all using markdown blockquotes when they give me the formatting I want even though an html aside might be “more correct” but at the expense of having to write all the html and css to achieve the same thing.

                              1. 1

                                I love that you support <abbr> where most other tools don’t

                                1. 2

                                  Thank you. I still don’t know how to handle the following case [1]: “While the IRA may take actions against US interests that would effect Alice’s (a member of the IRA) IRA, can an automated process work out which expansion of IRA [2] should be used for each instance?” But so far, that’s been an extreme outlier case and rarely comes up.

                                  [1] From http://boston.conman.org/2003/11/19.2

                                  [2] For the record: Irish Republican Army, International Reading Association, Individual Retirement Account.

                      2. 1

                        Psst you can use use AsciiDoc most places you’d use Markdown and get that ability plus extra benefits.

                        1. 1

                          Yeah, AsciiDoc and ReST were both better options than markdown, imho… but I eventually gave in to the overwhelming adoption markdown got.

                          1. 2

                            I mean your own projects you can undermine the hegemony with said better options

                      3. 2

                        such as using blockquotes for callouts and admonitions

                        Is this a correct use of blockquote? If not, why? And what’s the proper alternative?

                        1. 4

                          “aside” might be an appropriate element to use, if “callout” means something like what I’d call a pull-quote. If the text flows with its surroundings, maybe just a class on a regular paragraph would be sufficient.

                        2. 1

                          Huh! I might look into using figures and figcaption in my markdown; I’m already using html anyway for svg with png fallback. Thanks!

                        3. 7

                          I remember learning how lean HTML in the diveintohtml5 book. One thing missing from this article is the famous <meta charset="utf8"> which even works on IE5.

                          1. 3

                            That’s a good point! It’s important to specify UTF-8 if that’s what you’re using (and it’s usually what I use). It’s also compliant to indicate it with a byte-order-mark at the beginning of the document or with an appropriate content-type header.

                            The character encoding need not be specified if you’re happy with the default, but I expect nowadays most authors will want to use UTF-8.

                            1. 2

                              Why do you need a byte-order mark with UTF-8?

                              1. 3

                                “Byte-order mark” is a legacy name, dating back to the days when there were only two possible Unicode encodings (UTF-16BE and UTF16-LE). However, it’s a codepoint that’s represented differently in every standard Unicode encoding, does not represent a valid Unicode character, and its various Unicode encodings extremely unlikely to appear in any non-Unicode encoding. That means it’s very useful as an encoding declaration, even for encodings like UTF-8 with only one legal byte ordering.

                          2. 6

                            This is fine for simple things that won’t be scraped but if you’re building something that might be scraped, please, from someone who spent years writing scrapers and crawlers, write standards-compliant, validating HTML5. It’s easier to introduce syntax errors and other problems when doing shorthand stuff. If you want to be lazy, consider a preprocessor like HAML or Jade that can emit good HTML that satisfies human eyes, browsers, and scrapers alike.

                            1. 19

                              This is standards-compliant, valid HTML. That’s part of what’s so great about it :)

                              Check it out: https://validator.w3.org/nu/?doc=http%3A%2F%2Flofi.limo%2Fblog%2Fwrite-html-right

                              1. 9

                                Validators and most browsers will certainly handle this correctly, since it is, after all, valid. But I wouldn’t be surprised if most other HTML parsers will (incorrectly) not handle it.

                                Oh, did you know there’s an even better shoelace knot than tying the bow as a square knot? It’s more secure, and much harder to mess up.

                                1. 3

                                  Oh no! I just got the hang of tying them this way… but thank you!

                                  I think my coworkers thought I was joking, but I’ve often commented that shoes and socks that don’t let you down are more important than we realize for leading a happy life.

                                  1. 6

                                    Sam Vimes would agree 100%.

                              2. 14

                                I’ve written my fair share of scrapers as well, and my feeling is as a scraper author it’s your responsibility to handle the content you’re ingesting. Generally when scraping, your gain >>> their gain (if their gain is even positive, which it often is not), so asking them to do extra work to make your scraping effort easier feels unfair.

                                Also, a standards-compliant parsing library should handle this fine. For example, bs4 does:

                                >>> from bs4 import BeautifulSoup
                                >>> BeautifulSoup(...) # snippet from article with unclosed <p>, etc.
                                <!DOCTYPE html>
                                <html><head><title>Building a Streaming Music Service with Phoenix and Elixir</title>
                                </head><body><h1>Building a Streaming Music Service with Phoenix and Elixir</h1>
                                <p>
                                I thought it would be nice to make a streaming music service focused on
                                bringing lo-fi artists and listeners together.
                                Early on, I built a series of prototypes to explore with a small group of
                                listeners and artists.
                                Since this is a technical article, I'll jump right into the requirements we
                                arrived at, though I'd love to also write an article on the strategies
                                and principles that guided our exploration.
                                
                                </p><h2>Requirements</h2>
                                <p>
                                We liked a loose retro-computing aesthetic with a looping background that
                                changed from time to time.
                                We preferred having every listener hear the same song and see the same
                                background at the same time.
                                And we liked the idea of sprinkling some "bumpers" or other DJ announcements
                                between the songs.</p></body></html>
                                
                                1. 1

                                  This is a great demonstration of the evolution of communal tooling over time. BeautifulSoup was unavailable to me in the environment I was using at the time (2009-2013; we did spectacularly questionable amazing things with mostly just XSL 1.0). I speculate that BS was not quite so robust back then, either. For really complex stuff, we could sub out to Selenium but it was very expensive to our crawling timelines to do so or we could switch to a JVM stack with an extended project scope and cost. I had the privilege of asking some government website authors things like, “Could you fix this one broken tag?” with a 15-30 day wait and it was done, saving the taxpayer some money.

                                2. 3

                                  Yeah, I wanted my site to be really easy to scrape. Even Bing (powers DuckDuckGo, Ecosia, Yahoo, and most other alternative engines) gets tripped up when you eliminate optional tags.

                                  I ended up going the other way by writing well-formed polygot XHTML5/HTML5 markup and validate all my pages with both xmllint and the Nu HTML Checker before each push.

                                3. 6

                                  troff authors usually started each sentence on a new line — a practice that made it easier to wrangle text on ancient paper terminals with ed

                                  Interesting perspective. I think this practice is underappreciated in prose: I find that when I write prose line by line, like source code, or like a poem, it’s both easier to read and wrangle as I wrangle my thoughts.

                                  It also produces less conflicting diffs for collaboration purposes. You would think this would make sense on Wikipedia, but too often, someone that doesn’t get it comes along and “cleans” it up.

                                  1. 5

                                    https://sembr.org/ is the link I usually send to people who ask me “is this essay supposed to be a poem? Why is it broken up like that?”

                                    1. 2

                                      Thanks, TIL.

                                      After having seen Aaron’s light HTML, I was actually surprised when I saw HTML missing on the list of «light markup languages that support semantic line breaks» – it does support semantic line breaks, so it’s not that… Then, I saw the “light”: HTML can be a light markup language, you just need to see it first.

                                      1. 1

                                        TIL, this is great. I’ve advocated for one-line-per-sentence for years but hadn’t a great name for it. Semantic Line Breaks is awesome.

                                      2. 2

                                        It also produces less conflicting diffs for collaboration purposes

                                        This is the main reason that I write like this. The other is that it’s much easier to move sentences around if it’s just cut line, paste line, rather than cut-range-in-line, paste-at-point-in-line.

                                        1. 2

                                          In one of my talks, about a Pandoc-centric document workflow, I implore writers to use one-line-per-sentence because it makes review in GitHub, GitLab, Gerrit, etc. much easier.

                                        2. 6

                                          This kind of tag-soup rubs me the wrong way. The only reason it’s valid is to grandfather the messed-up HTML that got written in the early days when people learned HTML on the street, and browsers would try to parse almost anything. This is literally the way I learned to write HTML in 1994 before I learned about the actual DOM structure.

                                          The result is sort of like a hypothetical lenient C compiler that will let you leave out close braces and parens and just do it’s best to guess what you meant. It saves a few keystrokes but it becomes very easy to make accidental mistakes that are hard to catch.

                                          And HTML like this in the wild is why scraping is so hard to do — you can’t just use an ordinary parser unless the page is XHTML; instead you need something that knows about all the shortcuts and will insert the necessary tags to make it valid. (libTidy is the usual tool for this.)

                                          1. 2

                                            I’ve heard this rationale for tag omission before, but I’m not sure it’s quite right. I was just a punk kid at the time all this went down, so maybe someone who was on the mailing lists at the time will stop in to correct me… but…

                                            I think it was always deliberate and meant to make documents easy to author. I found in the HTML+ spec from 1993 that the end tag for “p” elements was already optional at that time. I didn’t find mention in TBL’s “About HTML” of end-tags for “p” elements at all, optional or otherwise!

                                            https://www.w3.org/MarkUp/HTMLPlus/htmlplus_11.html http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html

                                            1. 3

                                              I think it was always deliberate and meant to make documents easy to author.

                                              It’s worth noting that there weren’t any authoring tools early on. Of course they’re going to make it easy to write by hand, because everyone is doing that.

                                              I have to wonder if perhaps the W3C push for XHTML was about 10 years too soon. In the early 00s, there was a lot of tooling for writing HTML. But the lightweight markup languages hadn’t quite taken off yet. I think markdown was created in 2005, and I first heard about it in like 2007 or 2008. You still had a sizeable contingent of folk writing HTML by hand at the time.

                                            2. 2

                                              Luckily this isn’t tag soup that needs a quirks mode tidying, but a valid structure with well-specified parsing rules. If you implement the html5 parsing algorithm as per spec, it will handle this fine. If you don’t, your parser is broken

                                            3. 6

                                              This is great! I’m already fully in agreement with the philosophy for my site http://catern.com/ but I didn’t know some of the tricks about when you could validly omit tags like <html> and <head>! I’m going to start using that right away!

                                              And, it’s obvious in retrospect, but writing tables one-row-per-paragraph makes perfect sense. Previously I’ve been using org-mode radio tables (complicated Emacs stuff) to write my tables, but maybe now I can do away with that!

                                              1. 1

                                                Nice site! I’m delighted I could show you something new :)

                                              2. 4

                                                In the 1990’s and early 2000’s the focus was on making HTML stricter so it was easier to parse. Then somewhere along the line it seems to have switched and headed for being hard to parse but taking fewer bytes. Was it just Google trying to cut down on their costs or was there some justification that I missed?

                                                Edit: this seems to explain the justification. Which makes sense now that I see it.

                                                1. 3

                                                  My hypothesis is that, in an alternative reality where almost all html were generated by machines, we’d have to deal with much fewer quirks in parsing and rendering, we’d not have to standardize this kind of mis-nesting and terse writing after the fact and as a result of that would have more browser and parsing diversity.

                                                  Writing a correct html parser nowadays is much harder than it should be. Html parsing alone is unlikely to matter for the diversity in web browsers, and writing html by hand is a rare art nowadays as well, but even then, this blogpost leaves me with a sour taste in my mouth.

                                                  1. 3

                                                    I have always been a little stingy with HTML indents. I started by picking where I indented intuitively but eventually realized that I omitted indents that don’t confer any useful information. For instance, indenting the entire contents of the html or body tags just wastes a tab’s worth of important screen space for nothing. If I am nesting blocks, each level gets indented for sure. The two sentences inside a paragraph tag? No indent. While some of the suggestions in this article are nails-on-chalkboard for me, I will say that the indenting being shamed in Fig. 2 just curves my spine.

                                                    Putting each sentence on it’s own line makes my grep tingle a little. That is interesting.

                                                    1. 3

                                                      you may be interested in ‘semantic line breaks’ https://sembr.org/ and ‘semantic linefeeds’ https://rhodesmill.org/brandon/2012/one-sentence-per-line/

                                                    2. 3

                                                      I’m a bit on the edge about this. I did learn a lot even if honestly I don’t write much HTML bu hand these days.

                                                      It does looks clean and challenges habits yet it relies a lot on knowing “obscure” details of HTML. It’s all implicit.

                                                      Nonetheless why not after all, it shows elegant HTML written with good knowledge of it.

                                                      1. 4

                                                        Thank you for your kind words! Maybe we can get the word out and make these details less obscure ;)

                                                      2. 1

                                                        Amazing tbh, I can’t wait to use this for my own site.