1. 23

  2. 5

    These are simple facts, and seems a bit awkward to say out loud, yet it is surprising how people choose the way of 304 nested divs dynamically loaded with javascript with custom class/id (just for matching some bootstrap CSS) only to display a centered div and be “responsive”.

    This shows that writing HTML / CSS by hand has actually a good result/effort ratio! Generating everything seems not that useful.

    1. 4

      That’s my hope from this series - you don’t need 5MB of JS to have a good looking site. It can be done with plain old HTML/CSS. Going back to basics can sometimes help. :-)

    2. 3

      I had no idea CSS had variables.

      I actually use a Haskell DSL to define CSS in my static site generator (rib), so I never needed that feature (as I can compose CSS however I want in Haskell).

      1. 2

        Sure, that tutorial will leave a newbie with a working website… but it could be better.

        On lines 15 and 19 you have unclosed tags. ;) People really should better learn to always close the tags and know the tools for validating well-formedness because it will save them time wondering why some things don’t look as expected. Knowing when it’s sort of safe to leave a tag unclosed needs more understanding of HTML and HTML parsers than a novice webmaster can or wants to posess at the time they make their first homepage. Maybe to the point of writing <img src="cat.jpg"> </img> because you can’t go wrong with it.

        Next, it should probably start with a semi-formal explanations of what elements, attributes, and selectors are. Less confusion later.

        Last but not least, it’s not IE6 era anymore! People should learn about browsers’ dev tools. It’s much more fun to edit CSS in the debugger and then copy and save it to a file when you are satisfied, then go through edit-upload-reload cycles.

        1. 10

          Maybe to the point of writing <img src="cat.jpg"> </img> because you can’t go wrong with it.

          Wouldn’t be valid HTML5 though.

          Must have a start tag and must not have an end tag.


          1. 2

            I still wonder what good it does, to make the “standard” (for which there is no DTD) intentionally XML-unsafe and impossible to parse into an AST without knowing which elements are supposed to have no closing tag. It gets especially funny if you consider the script element that must have a closing tag in HTML even if it’s there just for the src attribute.

            XHTML rules regarding tag closing are hard and fast and easy to teach. Luckily, XML’y syntax is understood by all real life parsers.

            1. 3

              Well, HTML isn’t XML, and never has, and isn’t even a descendant of XML. Trying to parse it with an XML parser would be kinda stupid.

              1. 2

                I know. I still think XHTML was a great project. You could have parsers that are not piles of special cases. You could have a machine-readable, complete descriptions of what exactly the allowed tags are. You could use the whole array of XML tools.

                There are so many great things under the “HTML5” umbrella, but giving up simple machine readability and verifiability is not one of them.

                1. 1

                  You can have those right now! Well, as long as you parse your own HTML and not other people’s. And I think many important use cases fall in this category.

                  I just wrote an HTML parser for all of https://www.oilshell.org/ ,and it’s very short and works well. The html.py file is 344 lines with no dependencies!

                  Honestly this was a bit surprising to me after 25 years of using the web. HTML is not a complicated file format, and it’s not hard to avoid ill-formed files (or fix them with tidy).


                  Note the http://xmlpull.org on that page – it’s a very old alternative to DOM and SAX that’s a lot easier to consume. It’s weird that nobody uses this style! (e.g. the Python stdlib) I want to import it into Oil.

                  The inputs are the HTML that CommonMark generates, hand-written HTML, and HTML generated by Python scripts.

                  So if you know your inputs, it’s not a hard problem.

                  I may write a troll-ish blog post entitled: All You Need to Parse HTML is Regular Expressions. The caveats are:

                  1. You also need a stack. But as long as your programming language has functions (which all do), you have a stack!
                  2. You can’t parse other people’s HTML, as mentioned. But there’s absolute no harm in trying. HTML has no syntax errors but my HTML lexer will give you some syntax errors.

                  So the whole Oil site toolchain is now based on this style of HTML processing. It does a lot of stuff including generating the TOC, generating headers and footers, generating the help text, expanding shortcuts, removing comments, etc.

                  It would be worthwhile to define the subset of HTML recognized. But it’s honestly not a big problem. The whole thing fits in like 8 regexes! It’s almost a spec already.

                  edit: It’s true that my lexer doesn’t handle <script> or <style> correctly, but none of my docs use inline scripts or styles. There are probably a few other things like this, but I guess I’m saying that there’s a subset of HTML that’s easy to recognize and very powerful.

                  1. 1

                    Well, regular expressions plus stack is a pushdown automaton, exactly the thing you need for recognizing a context-free language.

                    1. 1

                      Yup! Basically my point is that it’s not hard to parse a large useful subset of HTML “by hand”, or with your own small library. You don’t have to go to XML to get nice tools.

                      I parse it with “spans” (start pos, end pos) rather than materializing objects, which makes it fast.

                      And I also “lazily” parse what’s inside <a foo="bar">. So that span is only parsed for its structure if the program uses it.

                      • With SAX-style APIs, you lose your stack because you get callbacks. The control flow is inverted which makes HTML processing by hand hard. I wrote a few processors against Python’s API and they look really ugly because of this.
                      • DOM style APIs are convenient, but they can be slow, and you’re forced to recognize everything up front. You have to recognize void elements. But all the HTML processors I write don’t care about void elements. For example, if I want to extract <h2></h2> and put it in the TOC, I don’t look at anything else. I just look at the lexical structure and not the tag structure.

                      I can just reuse tidy for the tag validator, or it would be simple to write my own with the list of 14 void elements. But I don’t need to do that to “get work done” with this approach. These details don’t block me because recognizing the lexical structure of HTML is easy, and then matching selected tags is easy if using the stack you already have.

                      XML has a this heavyweight “all or nothing” approach. But you can get some of those advantages with HTML. I think they did a good job cleaning it up with HTML 5.

                  2. 1

                    Ignoring the bad parts of XML (the extreme verboseness, the cost of the added extensibility which is unnecessary for something like HTML, …), it would have been kind of nice to have a more machine-readable grammar which could be parsed into an AST without knowing the definition of every tag. It would require a pact between all browser vendors to always show an error when they encounter malformed HTML instead of competing on who can make sense of malformed HTML the best. However, that’s not really the world we live in, and it’s certainly unrealistic to convince browser vendors to break currently-working websites, and a new XHTML-like initiative would certainly be ignored just like the original XHTML largely was.

                    In HTML, <img src="foo"> is the correct way to introduce an image element. HTML parsers will just have to keep around a list of which elements are self-closing. That’s not really a big issue at all IMO.

                    1. 1

                      I agree this issue is annoying, and I thought about making my HTML lexer aware of it (mentioned in a sibling comment here).

                      The list of void elements is pretty small though. It’s this list of 14 tags.

                      Is there anything else that’s complicated about syntactically HTML vs. XML? <script> and <style> are two other special cases because they can have < in them.

                      I’m not sure what the issue is with template, textarea, and title. I’d be interested in knowing the details.


                      There are six different kinds of elements: void elements, the template element, raw text elements, escapable raw text elements, foreign elements, and normal elements.
                      Void elements
                          area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr
                      The template element
                      Raw text elements
                          script, style
                      Escapable raw text elements
                          textarea, title
                      1. 1

                        What is the extreme verboseness so many people are talking about? It’s no more verbose than HTML. You can close any tag with <tag /> and it will be syntactically valid (whether the schema allows it to be empty is another story irrelevant to parsing).

                        Also, bringing CDATA/PCDATA pragmas back to HTML would save people a lot of special character escaping.

                        1. 3

                          The doctype is incredibly verbose. Consider the XHTML Doctype (and anything based on XML would need to have something similar):

                          <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

                          Versus the HTML doctype:

                          <!DOCTYPE html>

                          All the XML namespacing stuff is incredibly verbose (complete URIs everywhere), and largely unnecessary for HTML.

                          XML can’t contain tags which work like HTML’s style or script. This will always be invalid XML:

                              if (a < b && b > c) { ... }

                          In XML, we’d need to write:

                              if (a < b && b > c) { ... }

                          The same probably applies to CSS (unless your CSS happens to not contain any characters XML treats differently).

                          (By the way, HTML has cdata for the rare cases you need it: https://html.spec.whatwg.org/multipage/syntax.html#cdata-sections)

                          The closing tag stuff is obviously sometimes slightly more verbose, but I agree that writing <img src="foo" /> isn’t a big issue. Also, making the rules about self-closing tags consistent instead of defined specifically for every element it applies to would cut down on other sources of verbosity, since <script src="foo.js" /> would be legal, as you pointed out earlier.

                          1. 2

                            That doctype a) has nothing to do with XML b) provides a link to a machine-readable description of the language.

                            CDATA pragma allows you to have the parser ignore special characters inside any tag. In HTML(5) script and style tags are special in that they are allowed to have unescaped <, > etc. in them. In X(HT)ML you can do it for any tags, e.g. have readable code snippets in <pre> and <code>, at cost of of those two extra lines.

                            What people think of as verboseness is not bloat, it’s trade-offs.

                            1. 1

                              and it is worth noting that the html behavior itself can be weird: try var ending = "</script>"; in there and enjoy the syntax error; you have to do like ending = "</" + "script>"; or whatever. So yeah trade off city.

                              Of course, if you are writing your own html(ish) parser it is easy enough to just special case those two tags; once you see the opening, you string scan for the next closing. Which is of course exactly why the "</script>" is problematic; the outer layer just sees that sequence of characters without knowing what the context is supposed to b.

                2. 1

                  “ Tag omission in text/html A non-normative description of whether, in the text/html syntax, the start and end tags can be omitted. This information is redundant with the normative requirements given in the optional tags section, and is provided in the element definitions only as a convenience.”

                  So it’s only invalid if the “optional tags” section says so. I couldn’t find that on the “images” page.

                  Ah here https://html.spec.whatwg.org/multipage/syntax.html#syntax-tag-omission

                  No mention of IMG in that section at all.

                  My interpretation of the above is providing the end tag is redundant but not illegal.

              2. [Comment from banned user removed]

                1. 1

                  My main site is WordPress with a theme I built (based on another theme). I’ve been meaning to put the theme on Github or something, but I haven’t got around to it yet. I will at some point though.

                  If you’re talking about the light website, I’ll put the source for that up on Github (and/or on the Downloads page) once the series is finished.

                  1. 2

                    It would be cool if Neocities had an option to enable a “Download site source” link in the public profile to allow every visitor to get the site source and play with it. That would be a relatively simple PR. Sadly, Kyle Drake doesn’t seem to have much time to maintain it these days, my PR to the neocities-cli and a few “RFC” type issues in neocities repo has been ignored for months.

                    1. 1

                      Once the process is finished I will be adding a link to download the entire site source. The source for each step I’ve published is already available for download - https://mylight.website/downloads.html