1. 42
  1.  

  2. 22

    The part about how XHTML failed to take root and how web devs were stupid/lazy for not wanting to produce syntactically valid XML feels a bit too hindsighty.

    It wasn’t until HTML5 that browsers actually agreed on how to deal with malformed HTML. The early HTML standard also played fast and loose, and constructs like

         <b>foo <i>bar</b> baz</i>
    

    were commonplace, as was the use of <p> as a paragraph break rather than a paragraph wrapper.

    It’s easy now to say everything should’ve been an AST from the start, and throw shade at web people, but it was a much harder sell when there was way more legacy code and content around that simply wasn’t well formed.

    OP makes the point that custom escapers need to match what browsers do, but that assumes browsers even agree. In the heydays of IE6-8, it was totally common to use subtle differences in CSS parsing and interpretation to make e.g. a selector that targeted one browser version that would be inert on others.

    Yeah it was icky, but there was a reason string concatenation was the common practice. You had little other choice. And it wasn’t the fault of yolo’ing PHP devs, but Microsoft, one of the biggest software houses around.

    On top of this, XHTML made the fundamental mistake of trying to dictate a standard without offering any compelling reason to switch to it. All it added was busywork. HTML5 instead made the saner choice of not using a full blown XML wrapper, and was better for it, while offering actual improvements people wanted.

    1. 15

      While I agree that for most people the benefits of XHTML were not exactly exciting, I wouldn’t say they were nonexistent. Being able to process HTML with a standard XML parser, rather than having to find things that can parse HTML specifically, is a benefit. Being able to apply standard XML transformation tools is a benefit. Though I would argue the most significant benefit is namespacing, which is a feature of XHTML I use even today.

      The lack of namespacing in HTML has resulted in a lot of crude hacks, like the data- prefix, which I feel is pretty unfortunate. HTML5 ultimately adds SVG support to HTML5 simply by the fiat of saying that <svg> is SVG, but there’s not really any clean way to integrate new schemas. I think this actually significantly undermines many of the uses of HTML/XML; namespacing allows you to freely annotate documents however you like while keeping them still readable by existing browsers.

      Also, while ultimately there was this negative reaction to XHTML, honestly that came relatively late. I remember the heydey of alistapart.com and CSS Zen Garden where, frankly, there was a very positive ethos to web design. People cared about the idea of the semantic web, of using the semantics of the various HTML tags correctly, of not using tables for layout, of using CSS properly, and using XHTML. People would proudly put those W3C ‘valid XHTML’ badges on their site (remember those?). It wasn’t that uncommon for people to talk about how they were annoyed they couldn’t serve their pages as application/xhtml+xml yet because IE didn’t support it. So it seems to me there was actually a fair bit of enthusiasm and good faith engagement at first. To my recollection, the ‘rejection’ of XHTML really came with the HTML5 movement, which above all else seemed quite focused on the web as an application platform and less as the semantic hypertext platform envisaged by the W3C.

      1. 19

        XHTML was a mess, and I think you have rose-tinted glasses/nostalgia looking back on it. I was there, active in web standards communities in the early to mid 2000s, and I recall it very differently.

        Evan Goer’s “XHTML 100” experiment is still online and is a stark reminder of what things really were like back then. For those unfamiliar: in 2003, he picked a good-sized sample of websites run by “alpha geek” web designers/developers, people who advocated for and presumably understood the state of the art at the time. And he ran three simple tests on their sites: (1) does the home page validate as XHTML; (2) do three secondary pages validate as XHTML; (3) does the site serve the XHTML content-type to user agents which accept it? 74% of sites – and remember, these are the personal sites of “alpha geek” folks who are supposed to know what they’re doing! – failed the first test. Only one, out of 119 in the starting sample, passed all three tests.

        And that’s kind of where XHTML was for much of its early days; something that people were weirdly enthusiastic about and pushing heavily, despite largely not doing it correctly (and getting away with it because browsers were still rendering their sites with forgiving HTML parsers).

        Then people like Mark Pilgrim started pointing out deeper problems with XHTML – there are whole labyrinthine tangles of multi-spec interactions you can get into when you serve XHTML-as-XML-over-HTTP, like a sequence of bytes that is a well-formed XHTML document if served as application/xhtml+xml but not if served as text/xml, or issues of validating versus non-validating parsers, or issues of which things are CDATA in HTML 4 but PCDATA in XHTML, or the different DOM APIs depending on whether a document parsed as HTML or XHTML… and all for what? For what benefit? The go-to demo was always MathML, but people founds ways to do math without bringing on the pain of going full XML.

        The thing that killed XHTML, finally, was W3C’s architecture astronautics around updates; XHTML 1.1 already went in that direction, and the ultimate failure of XHTML 2 just kind of sealed it. They had become, effectively, completely disconnected from both browser vendors and web authors, and were off in their own little bubble doing stuff that they thought was elegant, and everybody else got tired of saying “no, what we need is something that’s practical”, and went off and formed WHATWG and the rest is history.

        1. 5
          • Having to deal with long, convoluted namespaces and other strict XML aspects was terrible. And there were multiple doctypes as i recall, strict vs transitional, which was a bad idea. And the idea that you can just embed foreign content into a page is by itself a bit absurd. It’s pointless unless it’s supported by every client in use.

          • Semantic HTML was, sorry to say it, mostly a cult. And the list-apart people were part of that. They came up with absurd css hacks just to avoid having to add a wrapper div or span out of a misguided sense of purity. In HTML5 it’s really no better: you should look up what the guidance for e.g. the article-tag is. It basically guarantees nobody is every going to do anything useful by looking for that tag.

          • XML tooling is awful. Nobody sane uses XSLT, and everyone switched their APIs to JSON the minute it became commonly supported.

        2. 7

          It wasn’t until HTML5 that browsers actually agreed on how to deal with malformed HTML.

          If you construct DOM (or create valid XML another way) on the server, browser mistakes while processing invalid documents does not matter.

          The problem was in the organic growth of the web technologies – XHTML arrived too late, when everyone was already used to produce garbage („it will be displayed somehow anyways“) and to consume garbage („there is lot of invalid documents on the web and I want to see them so I need a browser that displays random tag soups“). On the other hand, the organic growth was also one of big opportunities of the web and the web took big advantage of it.

          P.S. What if we allow invalid ELF or other executables and invent some heuristics to run them approximately to how they were intended to run? We can find functions/calls with similar names or skip missing ones and it will run somehow. It would lead to existence of lot of garbage executable files, people will get used to it and will not bother to create valid binaries. And then it would be really difficult to fix this state.

          1. 5

            P.S. What if we allow invalid ELF or other executables and invent some heuristics to run them approximately to how they were intended to run? We can find functions/calls with similar names or skip missing ones and it will run somehow. It would lead to existence of lot of garbage executable files, people will get used to it and will not bother to create valid binaries. And then it would be really difficult to fix this state.

            That’s not super far away from a reasonable description of some of the compatibility hacks lurking in Windows. Those are API-level things as opposed to executable format things, but that kind of chicanery certainly occurred. The slope wasn’t as slippery as you suggest, but it certainly left a mess for quite some time.

            1. 1

              It would lead to existence of lot of garbage executable files, people will get used to it and will not bother to create valid binaries. And then it would be really difficult to fix this state.

              Correct. And XHTML was a misguided way to try and fix that.

            2. 5

              On top of this, XHTML made the fundamental mistake of trying to dictate a standard without offering any compelling reason to switch to it. All it added was busywork.

              XHTML was about XML data. It was about treating web documents as data, and potentially even evaluating them under other document/data semantics using schemas. All of our existing tools to work with XML and XML Schemas were thus enabled through this unification, and presumably we’d have had a world where we’re returning XML from our web APIs instead of JSON, and that XML could be data for someone to process with a consumer app, or could be a document to be marked up and rendered via a user agent. Instead we live in a world where we return HTML or JSON, rather than one unified form of data.

              1. 3

                It wasn’t until HTML5 that browsers actually agreed on how to deal with malformed HTML. The early HTML standard also played fast and loose, and constructs like

                <b>foo <i>bar</b> baz</i>

                That’s not what I recall. Browsers would (usually) render it correctly, but HTML 1.0 was a dialect of SGML, which was quite explicit that this was not permitted.

                1. 3

                  HTML was modeled on (or inspired by) SGML, but it has never been a dialect of it. In much the same way that some English grammarians like to claim that English is derived from Latin and so should imitate Latin habits, the W3C wrote into the HTML 2.0 standard (and perhaps earlier formal ones) that HTML was derived from SGML without it actually being so. Since early W3C standards were essentially documenting existing browser practice instead of defining new practices, this redefinition was as ineffective as you would expect. Browsers paid no attention to the claim that HTML was SGML based and so should support various SGML features or behave in (incompatible with existing practice) SGML-ish ways. HTML5, which was set up to document existing practices, finally put a stake in this by saying explicitly that HTML was a custom format merely inspired by SGML.

                  (You can read a polite version of this in the Parsing HTML documents section of the HTML(5) standard. I have some feelings about this whole area because I was there at the height of the XHTML wars, and not on the XHTML side.)

                  1. 1

                    I first learned HTML back in the 1.0 days when the img tag was not supported by all browsers and the example that you gave was in every HTML tutorial as an example of something that might work but was I’ll formed. There were a lot more HTML parsers back when you could write one in a few hundred lines of code and they did not all do the consistent thing with that example.

              2. 17

                I’m a fan of templating languages that pre-compile the templates. This can give you the best of both worlds — you get syntax-checked safe AST-based templating, but the produced compiled template code can be reduced to banal fast string-printer.

                1. 4

                  Rust’s askama does that. It’s pretty cool!

                  1. 3

                    I designed Pushup to be exactly this. Pushup pages (.up files) have their own syntax (lightweight gluing Go code and HTML together) that is parsed and compiled to pure Go code. Each Pushup page is turned into a Go struct that implements an interface that is sort of like http.Handler from the stdlib, and the generated code in the relevant method is just a fancy printer of (safely escaped) values to output.

                  2. 6

                    The bit about JSX is a misleading mental model. JSX is not XML, and thinking of it as such would only confuse a developer when troubleshooting. JSX is a macro transform into a function call. The output is neither strings nor direct HTML objects, but a literal call to React.createElement(). the code in { } forms is not “embedded” in XML or some data structure, but is an escape into the literal code it will transform into. Thinking about things in any other way will only lead to confusion, as I’ve personally witnessed and had to help many jr devs work around what they thought would be valid JSX.

                    1. 1

                      Then why did they choose to make it look like XML but behave in quirky ways that don’t match XML or HTML—especially given how verbose and error-prone it is to write and noisy to read?

                      I remember when there was a toggle in the React docs to show the function side without JSX so folks didn’t need to resort to a compilation step to try it out.

                      1. 3

                        why did they choose to make it look like XML

                        XML is more regular than HTML, and thus significantly easier to parse. Also, it’s used for React Native, which has no relationship to HTML.

                        Personally, at least for personal projects, I prefer to avoid JSX and use something more Lispy like ijk instead. I find it much more preferable to work with actual data than some silly syntax sugar for function calls. With it, I can do my own macro style transforms on elements at runtime, which is a powerful approach. (Although admittedly, I would likely just end up using ClojureScript for anything serious.)

                    2. 4

                      Nice article. Note that the recent matchertext draft paper proposes a solution to the problem that plagues both approaches:

                      The only real limitation of the AST-based approach is that some strings in an AST may embed other languages. This is the same circumstance where the previously mentioned context-sensing autoescaping approaches can easily fall down.

                      https://lobste.rs/s/9ttq0x/matchertext_escape_route_from_language

                      Admittedly it has a slim chance of being adopted! But most commenters didn’t seem to get the idea, probably because you had to follow the blog post into a paper to see the details.

                      https://bford.info/pub/lang/matchertext/

                      The matchertext paper specifically addresses:

                      • URL in HTML (which the blog post does too)
                      • URL in URL

                      and a few others.


                      Good link to the Go work at the end:

                      https://rawgit.com/mikesamuel/sanitized-jquery-templates/trunk/safetemplate.html#problem_definition

                      I would be very interested in some further analysis of where Go’s html/template falls down.

                      It defines “context” as

                      A parser state in the combined HTML, CSS, and JavaScript grammar used to determine the stack of sanitization routines that need to be applied to any untrusted data interpolated at that point to preserve the security properties outlined here.

                      So presumably it does not handle autoescaping of URLs in HTML? Since it doesn’t mention that language.

                      I know that work is over 10 years old, and Go is very popular. So I’m surprised with all the Go rants that I haven’t seen people poke at its limitations?

                      Is it just better than everything else and that’s what we should use?

                      Why hasn’t it been adopted in other languages?

                      If it works reliably, then it seems better than the AST approach, which only solves “one level” of the problem as mentioned.

                      The other solution is to get try to rid of escaping entirely by adjusting all the text formats, like matchertext, although that is probably only deployable in the very long term. It’s not a near term solution

                      1. 4

                        The Go approach is being revived as a HTML/JS spec draft called Trusted Types. Really fascinating stuff (both the original Safe Types from Mike Samuel et al as well as the new TT). Chrome already supports that draft spec and it seems to work for Google web properties. However, they didn’t manage to convince other browsers yet.

                        Meanwhile, I believe a HTML sanitizer might also help a lot. So I’m working on specifying that (prototype implementations exist in Chrome and Firefox).

                        1. 3

                          OK interesting! I’ll have to look up Trusted Types.

                          I remember ~10 years ago when Mike Samuel was going around adding auto-escaping to all of Google’s template engines – there were at least 3 or 4 in common use, and most of them weren’t open source. And then he added it to Go’s templates, which was new at the time.

                          But then I feel like I never heard about it again? Were there just lots of silent happy users?

                          It always interested me, though I guess I don’t know enough about the specifications of HTML, CSS, and JS to really judge it (and I’m not a Go user, so I didn’t get experience with it.) I think my issue is that it seemed to be “brute force of special cases”, though I would want to look at the code to understand it more.

                          I just read over the original paper again, and this part jumped out …

                          The last of the security properties that any auto-sanitization scheme should preserve is the property of least surprise. The authors do not know how to formalize this property.

                          Developer intuition is important. A developer (or code reviewer) familiar with HTML, CSS, and JavaScript; who knows that auto-sanitization is happening should be able to look at a template and correctly infer what happens to dynamic values without having to read a complex specification document.

                          I think my issue is that whenever I have more than 1 or 2 levels of escaping, I try to break it up into dynamic composition, not textual substitution. So I manually avoid complex escaping, but not everyone does.

                          I have been meaning to write a blog post about that, with examples in shell.

                          Still it’s an interesting problem that needs to be solved …

                      2. 3

                        The last bit reminds me of langsec.org

                        1. 2

                          Yeah — I was reminded of langsec.org while writing this, though it wasn’t what gave me the idea. But it’s definitely touching on the same concepts and ideas.

                        2. 3

                          I’ve been using my dom library for html templates for over a decade now and while I’ve never quite formalized it, I’m really happy with it. What I do is make a .html file which is parsed into a dom tree. If it is not well-formed, this throws an exception immediately, meaning I won’t accidentally have mismatched tags or other simple mistakes like that. (While browsers can randomly disagree on how to correct malformed html, if it is well formed, you do have a strong degree of compatibility and have for a long time). Then it goes through and does the contextual replacement.

                          I’ve changed the exact details of how replacements are done over the years. I’ve experimented with <span data-content-from="some_var"></span>, which would get the content added by the server and be kept into the final output; you could do a css rule to target [data-content-from] if you wanted. But I’ve felt that’s a bit wordy so lately I’ve been doing <%= some_var %> - and my dom library processes the whole <%..%> block as a node it can replace with the result of the expression. And yes, this is context-aware, so it can format as html or as json depending on if it is in a script or whatever.

                          There is one problem though: suppose you want <a href="thing/<%=id%>"> That’s not gonna be a dom node so i do fall back to string processing inside the individual processed attributes (and my comment says: “// I don't particularly like this” but it was something i could make work). But at least it is still aware that it is in an attribute and even knows which one it is. (And actually, truth be told, inside script tags it is mostly a string process too, since I didn’t parse the javascript, so it would be possible to break this in the current implementation but still it is better than nothing; at least you can’t change what tag you’re in so a script is still a script and a div is never a script.)

                          Then for loops and such, I again used to just require that be done in the programming language and you return a node, but lately I have played with <for-each over="arr" as="x">...</for-each> so the actual construct remains proper dom nodes. A bit awkward with <if-true cond="a &amp;&amp; b">… yes, <if> isn’t an allowed html extension tag, and the ampersands must be escaped in the source. I’m still not 100% happy with it but it really isn’t bad; it encourages me to refactor out those complex conditionals to helper functions which is often nicer code anyway.

                          Then the other nice thing is combining things, you can replace nodes with snippets from other files, and each file must parse on its own, and the resulting combination must also be well formed, all by definition of the ast node approach. So you might write a site skeleton with <html><main>to be filled in...</main></html>, view that file independently, then write a partial with <main>content</main> and have the template system combine them by substituting the main tag and … boom the combined total works with no possibility for mismatched tags, repeated ids.

                          Additionally, you can do things like fill form fields based on the parsed info automatically, make your dom do like form.setValues(some_object) and it finds them and/or adds them with populated values - no need to explicitly write out name="foo" value=<%=data.foo%> and such every time. I like this a lot.

                          But in any case over the specifics, both my old way of dom functions in code and my new way of loading a file both bring me sooooo much joy over string templates. Using ruby erb, I had a bug go to production because some big monster of a thing ended up generating <div><form></div><input type=submit /></form></div> and…. even in the same browser it would SOMETIMES work and sometimes not. I don’t even know why it was so random but it’d work most the time and then just do absolutely nothing when you click the button other times. (Maybe it had to do with how the html streamed off the network, making it more likely to work with local tests? idk) But that same thing with my dom approaches would have been an instant exception, it wouldn’t have survived five seconds of testing, much less a deployment.

                          I’ll grant you can achieve this kind of thing too by just passing the output of your other thing through a validator but… i’ve never seen that done in practice anywhere else and with the dom library, it just always just works.

                          I’m happy to see a blog post that actually agrees with me on the concept for once! But I’m still surprised more people haven’t played with it. Perhaps I’m somewhat unique in that I made a custom server-side dom library with dozens of convenience functions; I’ve tried this approach with other dom libs and i still like it, but I don’t love it without my helper funcs. Still those weren’t that hard to write so still surprised i haven’t seen more people try it.

                          1. 1

                            Neat. I’ve wanted to make a templating language along those lines for a few years now, but I haven’t found time for it. What language are you working in? Nothing public yet?

                            1. 2

                              I do everything in D, here’s the docs for my newer thing: http://arsd-official.dpldocs.info/arsd.webtemplate.html and the dom lib: http://arsd-official.dpldocs.info/arsd.dom.html the files all live in here: https://github.com/adamdruppe/arsd The code for the webtemplate.d is fairly short, since it leans heavily on the dom.d and script.d modules. I kinda wanna do more with it, but I think I only have one public project using that webtemplate library and one work thing and my older techniques with pure dom are not public.

                              The public project is: https://github.com/adamdruppe/ffr-bingo not exactly the most beautiful code lol im always self-conscious about sharing things, but if you look in the templates directory, skeleton.html and home.html are pretty easy to read, nothing too special there, but that’s the point - i wanted it all to look nice and familiar to people who don’t know D or my custom libs.

                          2. 3

                            Is there really anyone around who would argue to the contrary in 2023? I mean sure, there are plenty of people still using string based approaches, but only because they’re easy, or because the codebase already works that way, surely not because they actually think they’re better?

                            1. 3

                              What’s the argument precisely? That all languages should have JSX-like literals?

                              But many don’t currently (Go, Rust, Ruby, Python), so you will almost certainly have to use string-based approaches sometimes.

                              I agree that manual escaping is not really acceptable anymore. Auto-escaping is more desirable, but still has limitations. It’s also not widely deployed.

                              The major limitation of the AST approach is that it only goes one level deep, whereas JS, CSS, and URIs are commonly embedded in HTML. e.g. my comment above: https://lobste.rs/s/j4ajfo/producing_html_using_string_templates#c_zyoyqn

                              So honestly the problem is UNSOLVED and there’s not “one answer in 2023” that everyone should use.

                              1. 5

                                What’s the argument precisely? That all languages should have JSX-like literals?

                                I haven’t used JSX; the argument I’m thinking of is more that the ERb/PHP style of “here’s an HTML file; just smush some strings into it” is not as good as using your language’s native data structures to represent the document’s tree structures. For instance, I use this in Fennel to create an HTML list, but it’s the same notation any application would use for data structures:

                                [:li {} [:a {:href "https://www.minetest.net/"} "video"]
                                        [:a {:href "https://love2d.org"} "game"]
                                        [:a {:href "https://tic80.com"} "dev"]]
                                

                                The major limitation of the AST approach is that it only goes one [language] deep, whereas JS, CSS, and URIs are commonly embedded in HTML.

                                I understand that this is supported by HTML, but I have never understood how this could be considered a desirable thing in nontrivial cases. I see it done a lot, but every time I do it myself, I end up regretting it, and it goes much better when you have separate languages in separate files. (for JS/CSS, not URIs). The fact that this is not supported counts as a big plus in my book.

                                1. 1

                                  That approach doesn’t seem to be popular, for whatever reason. If I had to guess, I’d say it’s < 5% of web apps that use it, while string templates or JSX-like literals are 95%.

                                  For example lobste.rs almost certainly doesn’t use it, since it’s written in Rails. (Probably one of the few web apps that does is Hacker News, which is written in a Lisp dialect!)

                                  So I think plenty of people would argue the contrary in 2023 – like the people writing 95% of the web apps we use :)

                                  I think the reason is pretty mundane: You can do some version of that in Ruby and Python (and Lua), but it’s cumbersome and ugly.

                                  I also wonder if it’s better to have the same syntax for HTML/JSX across N languages, than to use N different syntaxes for the same thing. It certainly makes it easier to copy example HTML / CSS / JS from the web.

                                  1. 2

                                    Most rails apps I’ve worked on use slim (or haml in the old days) which you can argue if it’s pure ast, but it’s not just smushing strings like erb/php

                                    1. 2

                                      OK interesting.

                                      Well lobste.rs itself uses ERB, which appears to be “string templating with HTML auto-escaping”, as far as I can tell. It looks like you call raw() to disable HTML escaping.

                                      https://github.com/lobsters/lobsters/blob/master/app/views/messages/index.html.erb

                                      Now that I think about it more, it’s not clear to me that this solution is worse in terms of escaping than the AST approach.

                                      It understands a single level of escaping, and AFAICT so do most of the AST approaches.

                                      According to the article, examples of the AST approach include:

                                      • SXML as used by Lisp/Scheme and Stan for Python.
                                      • DSLs
                                        • These are DSLs implemented fully independently. Examples include Haml and Slim (both Ruby) and Hamlet (used by Haskell’s Yesod web framework). A large number of variant languages inspired by Haml also exist.

                                      But I think all of them are DSLs for HTML. Ditto with the XML-based solutions used, which also seem rarely used in practice.

                                      Do they understand

                                      • URI syntax inside HTML
                                      • JavaScript syntax inside HTML, including JavaScript string litearls
                                      • URIs inside JavaScript inside HTML

                                      ?

                                      What do you think @hlandau ?

                                      1. 1

                                        Back when I did Rails (and this was well over a decade ago) 95% of people used ERb because that’s what Rails used out of the box, presumably because it was easy to learn due to its similarity to PHP. But everyone I know who actually bothered to try one of the alternatives (haml, markaby, whatever) hated ERb, but got stuck using it anyway, because it was too hard to change once the project got going, or they had to make concessions to teammates, or whatever.

                                        That’s why my original question was not “does anyone use the string approach”; I was more trying to get at “does anyone who has tried both approaches still actually think the string one is better”.

                                        1. 2

                                          I mostly use the string approaches after trying structured approaches. I don’t think it’s better in every way, but the structured approaches all have drawbacks IMO.

                                          IME the problem with the HAML-like external DSLs is that the languages aren’t expressive enough for logic.

                                          Or you make them expressive enough, and now you’re programming in a crappy template language rather than a real language.

                                          When using HAML, don’t you ever want to call back to Ruby to do something tricky? (honest question)

                                          The internal DSL approach with Lisp-like structures avoids that problem, because you can interleave data and code. (Actually Oil is trying to do that for shell: https://www.oilshell.org/release/latest/doc/hay.html Basically make data and code similar and parsed the same way)

                                          But like I said, there is the mundane reason that doing such things in Python / Ruby isn’t convenient.


                                          So what I’m claiming is that there is still a problem to be solved. I don’t think the answer is “everyone needs to use AST based approaches, and anyone who tries will realize that it’s better”.

                                          Probably the very first web app I wrote in Python was using the AST approach, because it has obvious appeal. But IME it just doesn’t work well in practice. It’s harder to read and write, and it doesn’t give you additional security over a string-based template engine.

                                          And the issue of copying markup from examples is not trivial either. Most programmers don’t remember all the nooks and carnnies of HTML. It’s nice to be able to test things out in a file first, and then copy it into a dynamic program.


                                          My impression is that the AST approach works fine for small or toy websites (which is kind of what I write, very simple websites).

                                          But perhaps people who write “real websites” have settled on the string-based approach for good reasons – probably the issue of “volume”. i.e. making a website with 10,000 templates probably changes how you think about the tradeoffs.

                                          I believe string-based approaches are the default in nearly all web frameworks. Certainly in Django and Flask, in addition to rails. I don’t think the authors of those frameworks have simply never tried AST approaches.

                                          (and I would be interested in how many Rails apps use HAML vs. more string-based approaches.)

                                          1. 1

                                            When using HAML, don’t you ever want to call back to Ruby to do something tricky? (honest question)

                                            Yeah, so my memory here was pretty fuzzy; I forgot that HAML was an external thing. It’s much better than ERb but I don’t advocate its approach for the reasons you outlined above. What I prefer is something that uses the native data structures and methods of the language in question.

                                            But like I said, there is the mundane reason that doing such things in Python / Ruby isn’t convenient.

                                            I disagree; markaby here (AST-based) is quite clear and readable:

                                                h1 "Boats.com has great deals"
                                                ul do
                                                  li "$49 for a canoe"
                                                  li "$39 for a raft"
                                                  li "$29 for a huge boot that floats and can fit 5 people"
                                                end
                                            
                                            1. 1

                                              (late reply)

                                              Markaby looks nice! But it has the awkwardness of not having a place for the HTML attributes.

                                              It invents a special language for “class” and “id”, e.g. li.myclass and li.myid!

                                              https://github.com/markaby/markaby

                                              In Python it would be even more awkward:

                                              with h1("Boats", ):
                                                with ul(class="foo"):
                                                  li("$49", id="bar")
                                                  a("my link", href=url_for(mydict))
                                              

                                              You would have to put the attributes at the end.

                                              (Also I saw that Markaby was invented by “why”, who was early in Ruby’s community … I sort of refuse to believe that the authors of Rails and the like simply never tried it.)


                                              But interestingly a shell-like syntax would work quite well, and give you a lot of flexibility around quoting and interpolation:

                                              h1 "Boats" {
                                                ul class="foo" {
                                                   li id="bar" '$49'
                                                   a href=$[url_for(mydict)] 'my link'
                                                }
                                              }
                                              

                                              I thought about this for Oil, but I am a little wary of it since JSX-style literals seem to be ubiquitous. They have a good “copy-paste” story.

                                              But being able to interleave code and data seamlessly (which we can do with Oil + Hay-style data blocks), is a killer feature.

                                              Hmm …

                                  2. 1

                                    I’m not sure where that approach came from but I first saw it in Seaside and it was fantastic for two reasons:

                                    • it let you build modular components easily, where the outer view created a div and then called the inner view’s draw method.
                                    • it composed very cleanly with caching: if a view didn’t mark itself as needing a redraw then the framework could just reuse the same HTML (data structure or string) as last time.
                                  3. 1

                                    small update since I can’t edit: contextual auto-escaping isn’t widely deployed, but auto-escaping is

                                2. 3

                                  Perl’s CGI.pm circa 1999 did a lot of this kind of thing.

                                  I messed around with the general idea sigh 13 years ago: http://code.zoic.org/nontemplate/

                                  1. 3

                                    If I recall correctly, Haml didn’t actually build an AST - it just translated the syntax back to strings. And if you had a sub-template, you could call it (regardless of whether it was Haml or Erb) and get the results into the Haml via a string. So just having something that looks like it’s an AST doesn’t mean it actually has the security benefits of one.

                                    Note that nowadays most “string” solutions automatically escape their template-injected strings unless you explicitly ask it not to, so the problem isn’t as big as it used to be (although there’s still the problem of contextual placement, as the article points out). Nevertheless, I totally agree that a structural approach where you actually represent the tree you’re trying to write out as a string is the proper solution.

                                    1. 2

                                      Yeah I agree that the issue is more complex. On closer inspection, the article doesn’t actually justify what the title claims.

                                      i.e. It’s not clear to me that AST approaches are actually more secure than string-based ones in practice.

                                      https://lobste.rs/s/j4ajfo/producing_html_using_string_templates#c_smlpfm

                                      The comparison seems to be between real deployed string approaches, vs. mostly hypothetical AST approaches.

                                      AFAICT, the real AST-based approaches ALSO have the contextual escaping problem. Interested in counterexamples.

                                      1. 3

                                        AFAICT, the real AST-based approaches ALSO have the contextual escaping problem. Interested in counterexamples.

                                        If you treat URLs as proper objects, this is not a problem.

                                        For example, let’s say we want to construct an URI based on a base URI with two params (for the sake of example they’re hardcoded here):

                                        (let ((link-url (update-uri base-url query: '((param-a "this is" param-b "some&special stuff"))))
                                               (link-text "some&other special < stuff"))
                                          `(a (@ (href ,(uri->string link-url))) ,link-text))
                                        

                                        This then encodes to:

                                        <a href="http://www.example.com/some/page?param-a=this%20is&amp;param-b=some%26special%20stuff">some&amp;other special &lt; stuff</a>
                                        

                                        The uri library is responsible for correctly url-encoding query parameters and path components, while the SXML library is responsible merely for correctly HTML/XML-encoding the attribute and text nodes. I think this is as it should be, the XML stuff shouldn’t even be aware of URI-encoding.

                                        NOTE: JavaScript in HTML is a different kettle of fish, unfortunately it doesn’t compose very well - AFAIK, JavaScript cannot be XML-encoded, so for example the following would be the logical thing to do when encoding JS inside XML and hopefully also HTML:

                                        <script type="text/javascript">
                                        if (10 &gt; 2) {
                                          window.alert("News flash: 10 is bigger than 2");
                                        }
                                        </script>
                                        

                                        Unfortunately, this is a syntax error. <script> blocks behave more like cdata sections than regular tags. This is no doubt for ergonomic reasons, but it also means that any script handling needs to be special cased. In SXML you could typically do that with a preorder rule that shortcuts the regular encoding. But it’s ugly no matter how you slice it.

                                        1. 2

                                          Right, good examples. I’m not saying the problem can’t be solved in principle – just saying that I don’t see a lot of practical and robust solutions for it. That is, what are the solutions in HAML and XHTML approaches for embedded languages and auto escaping (JS, URIs, URIs in URIs), and are they deployed in real apps?

                                          It does seems like the Lisp solutions have the strongest claim, but I think they are mainly used for “simple” sites like Hacker News (which is good IMO – I like simple sites)

                                          But I’d say it’s a little like comparing Python and Standard ML … Standard ML is nice and clean in its domain, but not really a fair comparison, because Python has to deal with a wider range of problems, because it’s used 1000x more

                                          1. 2

                                            That is, what are the solutions in HAML and XHTML approaches for embedded languages and auto escaping (JS, URIs, URIs in URIs), and are they deployed in real apps?

                                            The idea would work just fine in any other language. For example, if you’re using an embedded DSL like haml, you could do exactly the same thing:

                                            - some_url = url_for(action: 'index', params: {param_a: 'this is', param_b: 'some&special stuff'})
                                            %a{href: some_url.to_s} Some text
                                            

                                            Unfortunately, this requires Rails. AFAICT, the standard Ruby URI library doesn’t have a clean way to add parameters from a hash to a query string (with auto-escaping, no less), so the example is not the best.

                                            BTW, if you have automatic string coercion methods you wouldn’t even have to call to_s on the result. I can’t remember if Haml does that.

                                          2. 1

                                            One example of a system for safely building URLs programmatically is Ruby on Rails’s feature of path and URL helpers. However, that only helps with building URLs internal to the app.

                                            The way it works is that the most common ways of defining a route also define methods that can build a URL to that route. For example, if you create a route /comments/:id, then in your view (whether it’s ERB, Haml, or another templating system), you can call the Ruby method comment_path(123) to get /comments/123 or comment_url(123) to get http://example.com/comments/123.

                                            1. 1

                                              I always really liked the syntax for constructing internal URLs. It’s too bad they didn’t build something equally ergonomic for constructing external URLs (AFAIK).

                                      2. 2

                                        I think a lot of new websites doing server side rendering are using react/jsx on the server side. Definitely not all, but I would disagree with the author that there has been a “lack of adoption” of AST-based templating. Of course PHP and ERB are both older and have a huge number of websites using them - but that doesn’t necessarily mean that adoption of AST-based templating systems has not also occurred.

                                        1. 1

                                          Interesting article. Libraries that offer DOM-building functionality seem to be few and far between. Two good ones that come to mind are Scala’s Scalatags and OCaml’s Tyxml. A huge benefit of these in statically-typed languages is excellent editor support. You have autocompletion and typechecking on all available tags and attributes.