1. 25

Recently I read somewhere that people tried to provide special characters into unicode that’d be available for markup. Reportedly the symbols are deprecated and no longer available. I’m thinking about using some of the below-0x20 utf-8 characters for the purpose of markup.

Every once in a while I try to build a language to improve upon markdown and xml. Usually I don’t tell anyone about these projects. Similar ideas floating around motivated me to write at this time.

Summary of design choices

I’m going to describe reason for each of these in a short while.

  1. Avoid escape characters entirely by using 0x01-0x1F -character codes.
  2. Represent vertical structures by tagging.
  3. Postpone bloated annotations within text bodies.
  4. Suitable for literal programming or interactive fiction.
  5. Allow editing as plain-text.
  6. Support rich-text editing.
  7. A format for word processing, convertible into HTML/TeX/PDF/PS/EPUB formats.
  8. Form a meta-language that is straightforward to parse.
  9. When HTML-like notation is suitable, use it.
  10. Algebraic, relational-like document structure.
Summary of inspiration

The inspiration is explained as well.

  1. CP/M text editors and word processors.
  2. advsys, a tool for writing text adventure games.
  3. String literals in ecmascript 6.
  4. Logic programming, datalog.
  5. CSS-selectors.
Structure

I know you are going to ask what would it look like, so I am going to show some examples and describe the format along them. This is what you would see if you were using a plain text editor to work with the files, or if you have “reveal codes” enabled in a rich text editor designed for this format.

Text editor would be adjusted so that they map some of the sub-0x20 characters to ordinary characters and recolor them.

The character mapping could be something like:

0x10: '$'    0x11: '¤'   0x12: '['    0x13: ']'
0x14: ':'    0x15: '`'   0x16: '{'    0x17: '}'

These would replace the DLE, DC1, DC2, DC3, DC4, NAK, SYN, ETB. I propose people interested about backwards compatibility would go through this list and refine it to ensure least amount of problems if these files end up catenated into any old-school terminal.

Additionally the control characters below space (FS,GS,RS,US) could be taken into the use that they were purposed for. This is not shown in the example and I haven’t decided over how they’d be used.

First I’m going to show how it’d work with just plain blogging.

$ + title ¤ Two English-language pangrams ¤

The quick brown fox jumps over the [i:lazy] dog.

Pack my box with five dozen liquor jugs.

The 0x10:'$' enters the program-mode whereas 0x11:'¤' resumes into the text-mode. These two modes would always start a fresh parsing.

Program-mode would use the whichever programming language used to accompany the meta-language that this file format is. In the above example the program-mode is entered to denote that the next text block is a title.

Program and text-mode can be repeated for extra meaning, for example above the text-mode is re-entered to drop out of writing the title.

The text-mode is supposed to break text into paragraphs separated by empty lines. Though the processing program decides what to do here, we would assume it doesn’t break the title into paragraphs, while it would break the plain text as it should.

Rich-text editors are supposed to disregard what the processing program ends up doing to the text and display the text as if it was paragraph-broken. Furthermore the editor may even make use of the line-breaking to treat it as line breaks were separating semantic boundaries within paragraphs.

The 0x12:'[' and 0x13:']' groups a structure during the text mode. The colon 0x14:':' would denote its tag.

Vertical structures would be flattened by describing the structure in front of each text element with css-style selectors.

$ + ul.0
$ + ul.0 li.0 ¤
First item
$ + ul.0 li.1 ¤
Second item
$ + ul.0 li.3 ¤
Third item
¤

Writing of links and other things not belonging in the middle of the text can be annotated in the program-mode.

Visit [a:the example website].

$ a: href=`https://example.org` ¤

The 0x15:'\’` is there for inlining text-structures inside the programs. This corresponds to ecmascript’s literal strings.

$
print(with_colors`{actor} is moving from {source} to {destination}`)

This kind of a block could be reinterpreted as:

print(
    with_colors(
        ["", " is moving from ", " to ", ""],
        [actor, source, destination))

Or how the programming language author desires to interpret it.

The holes marked with 0x16:'{' and 0x17: '}' would be also available within the text mode.

I suppose it’d be also useful to denote that a block of text should be interpreted as something else entirely. Colons could be reused for that use.

¤ :svg:
<svg height="100" width="100">
  <circle cx="50" cy="50" r="40" stroke="black" stroke-width="3" fill="red" />
</svg>

¤ :math: x [over:1:2]

¤ :lilypond:
\time 2/4
\clef bass
c4 c g g a a g2

I haven’t implemented anything yet, well.. I’ve tried whether Vim can reinterpret control characters when given a suitable plugin. Some of them it can reinterpret as ordinary characters, but not all of them!

Rationale for design choices

The control character codes are reused because they are hardly used with web and HTML. I think this is the best choice because it removes all need for having escape characters. Structured files won’t be embedded in structured files.

Vertical structures are represented with tagging because I’ve noticed the begin/end blocks that cross multiple lines, eg. in HTML, are really awkward to work with. Also if you miss an ending tag, which you eventually do, it will result in weird parsing results.

Structures embedded with tons of tags also produce gruesome mess. I think I may left the possibility to throw href-link inlined, if someone desires to still do it that way. Otherwise, the attributes for tags should be denoted some other way. After all you could index the tags pre-ordered by their occurrences in the text block if there’s an absolute need to add tons of annotations.

I think that literal programming is important. Great documentation is detailed down to the point that it ends up to detailing the source code. Program-representations should be clean enough that they are suitable for end-user consumption. Also many kinds of text documents end up nearing the needs that interactive fiction has for formatting text.

The plain-text-rich-text editing aspect of this is the hardest to explain. I think that any programmer just understands the value here. I realised this when examining the old word processing software. Great rich-text editing wouldn’t try to wean you away from plain text and the structure it already has, instead it would just richen it with structures.

When examining HTML/JS, I’ve realised that a separate publishing-step is usually necessary. Just writing out plain HTML-files end up being crummy because eventually you have to retrieve a table of contents or a link bar or whatever else is required for giving the user some interface along the content. Likewise a single publishing-source is usually not good enough.

The carrying idea in the format is that it requires you to choose a programming language that allows you to do compositing and processing within your document. In this sense it is a meta-language. I think ordinary people might use Haskell or Prolog for that, as they’re declarative languages that fit the purpose well. Anyway I think this is an important aspect of this kind of a format. Text documents could be more like databases and less like clumps of structures tied together.

HTML-notation is used where it’s suitable, because the only thing I hate in HTML is the vertically strewn structures. Inline notation for links and such is actually okay.

I wanted algebraic, relational-like structure for the document for the reason that I think it’s probably useful. I like that the pieces glue together in ways that might be something you’d do anyway, even without a rich text editor on top. Pieces put together should be simple and allow pattern matching on them.

Inspiration

The CP/M era text editing and word processor software is the main inspiration. I think it culminated in the WordPerfect’s “reveal codes” feature and then it started going downwards from there. I realised there’s a sweet point right there.

Advsys is practically a lisp environment for writing interactive fiction. Mostly it adds a customizable read-eval-print-loop and string literals into the language. It shows that text files and programs aren’t very distant relatives.

Ecmascript’s string literals are bit of a surprise. I think they’re actually best-implemented string literals you can have.

CSS selectors are pattern-matching tools, and pattern matchers work for generating as well as for querying structure.

Logic programming and datalog further reinforce the idea that code is data.

I’d like to hear if you have proposals for refining this thing, to get more out of it. All sort of criticism is welcome as well. I think there’s bunch of writing mistakes and garden path sentences, I hope you manage with them.

  1. 17

    I’m missing some context: what are the shortcomings of Markdown and/or LaTeX that this proposal will fix. Thanks.

    1. 4

      Same here, it was not really clear to me what problem the new format is trying to solve.

      It looks like a literate programming language specialized to output HTML. Does it mean that a new interpreter / compiler / stdlib has to be written?

      It looks like a prerequisite to understand this text is to read up on all the bullet points presented in the “Summary of inspiration” section.

      The control character codes are reused because they are hardly used with web and HTML

      this was my main clue and it’s quite late in the document

      1. 2

        It’s better to turn it upside down and think that we can improve upon Markdown. We definitely have to support and work with it quite some while, just like other common formats that we have now.

        I’ve used markdown every week, so I am well-aware of how it works for me and against in writing text.

        Markdown assigns meanings to character sequences: '#', '=', '-', '**', '_', '~~', '*', '1.', '+', '[', ']', '(', ')', ']:', '![', '>', ...

        The implementations parse these by block processing over the text with regular expressions. You can escape those sequences when you use them in the text, but it means that you got to stay on your toes when introducing these elements. Usually you catch it in previewing your message. There’s small differences in how different markdown implementations parse text that will cause inconvenience when switching among them.

        Markdown’s grammar of things it understands is fairly small. You get headers, links, few ways to mark text, images, quotes, lists, tables. You fall back to html when it doesn’t understand things, but it results in a weird, html-specific document. Extensions to markdown are implementation-specific and do not translate between them. This can result in the texts being monotonic as you tend to stick into what markdown offers.

        Formatting-characters trip ordinary users and frustrate them. This happens no matter how many there are of them. The proposal removes the need for this, as well as for the use of escape-characters. The idea would be to produce a form of rich-text editing that augments, but doesn’t obscure the existing structure of a text document.

        I’ve left open the choice of a meta-language, thinking that you’d use this interleaving existing programming languages and plain text together. In the web you’d restrict this to some nice, constrained, standardized declarative language that has the amount of annotations you prefer to allow.

        1. 7

          To me the attraction of markdown (and latex) is that its really just plain text. What I understand is that documents in your proposal are binary documents readable in the same way Microsoft word documents are readable: islands of text interspersed by encoded formatting directives.

          1. 1

            I cleaned up the demo that I did and pasted the files into the github. I used it in gvim, but also tried in vim. Probably from the vim it doesn’t do clean pastes without any help from the terminal.

            https://gist.github.com/cheery/2a34769a2398a345ad77235e8d1c3693

            I guess microsoft word documents are harder to read as plain-text without a dedicated editor.

            1. 2

              Some thoughts that come to mind

              1. Is the formatting going to be restricted to one or a few characters, or could we have strings of characters representing particular formats, like highlighting, colors, styles etc.
              2. Will there be complexity like macros, with variables (i.e. a DSL behind it)

              Depending on how this is setup I fear you will end up reinventing one of the document formats from the 1990s and 2000s and on (e.g. Microsoft doc format). You’ll need a particular piece of software to view the document as intended. Without this software, depending on the complexity of the intent of the formatting code the text part of the document could become unreadable. Pretty soon there will be viruses masquerading as such documents and so on and so forth.

              I guess I haven’t understood enough of this scheme to be convinced that we haven’t visited this before and found it as unsatisfactory as anything else, if not more.

              1. 2

                I guess that if you stretch it, you could have something like [i color=red:This text is red], but it’s probably something I wouldn’t like to add. I would prefer that the occurrences of structures may be indexed and then referenced in the program text portion to attribute into them.

                For example, if you used a Prolog-like language in the program portion, it might look something like this in the text editor:

                Hello [span.marked:Word]
                $ color(span.marked, "red"). ¤
                

                It’d probably take some time for people to hack the support for the format into their languages.

                There’s real concern for what you’re saying though. I can imagine how this could turn into something that can be no longer properly read with a text editor. I’m convinced that an eager enough person, not ready to sit down and think, would easily turn this into yet another wysiwyg. We’ve got numerous records of similar disasters. :)

                I did look at the .doc format from the 1990s. It’s been divided into 128-byte sized blocks with the text dropped somewhere in the middle. It looked like you’d be perhaps able to leak information with it. But also you couldn’t pop it open in a text editor and expect to be able to edit it without corrupting the file.

          2. 5

            Markdown was never meant to be something like {SG/HT/X}ML, it was meant to be a lightweight markup language for stuff like comments and blog posts. The fact that you can fall back to HTML is great in my opinion, it means I can integrate those features into my writing quite seamlessly (mostly anchor links and tables).

        2. 13

          You might want to read up on control characters before deciding what control characters you might want to redefine. I know that DC1 and DC3 are still in active use on Unix systems (^S will stop output in a terminal window, ^Q will start output going).

          As far as the “reveal codes” feature of WordPerfect, an HTML editor could do the same—HTML tags are a type of “code” after all. In fact, in the 90s we had just a program for HTML—DreamWeaver. Worked pretty much like WordPerfect except it used HTML instead of control characters.

          1. 2

            That is a gem. Thank you for finding it out for me! I’m going to look at it a bit and see if there’s some selection of control characters that could be reused without a drama.

            1. 1

              I read up on the control characters yesterday, especially I paid attention to the rfc20 because the codes seem to have remained mostly same since then and that document is the easiest to comprehend.

              The transmission and device control characters seem to be safest to use in text streams. They are used as control signals in terminal software but otherwise they wouldn’t seem to cause anything in file output. Therefore I probably can use the following block of characters:

              [01-06]     SOH STX ETX EOT ENQ ACK
              [10-16] DLE DC1 DC2 DC3 DC4 NAK SYN
              

              Emacs, nano, vim, gedit seem to be gladly printing these characters and still encode the file as utf-8. I also opened a Linux console and ‘cat’ these in. There were none sort of visible effects, so I guess it’s safe to reuse these codes.

              Most transmissions seem to assume that the stream is binary and that anything goes over, so I doubt this would have too many negative effects. Except that of course some other software could not reuse them anymore. Maybe it doesn’t hurt to provide some small file header that’s easy to type with a text editor.

              I don’t need this many, so I will probably leave the DC1-DC4 without interpretation and only reuse the transmission control characters on this range.

              And regarding DreamWeaver.. I think it’s a bit different thing. I’m not after a WYSIWYG editor, but rather something that is good for content processing yet leaves files suitable to examine with a good old text editor.

              The WYSIWYG editing requires that you’re creating poorly structured single-purpose content.

              1. 1

                The WYSIWYG editing requires that you’re creating poorly structured single-purpose content.

                I disagree. It might make it harder, and you have to be more disciplined to do so, but using a WYSIWYG editor does not in itself preclude well-structured content.

            2. 7

              I only actually comprehend how like a third of this fits together, but it’s absolutely bonkers and I love it. It’s hard to grok just ’cause you pack so much stuff together, but fascinating as a peek into how you think.

              1. 7

                In my view, this can be interesting if looked at as a new binary format for documents. I don’t think it is feasible as a text markup language. I believe the latter category has, by definition (?), an important characteristic of being built with “in band” characters from the set those available in common text editors. And somewhat counter-intuitively, the full ASCII set is actally not available easily in common text editors, I think.

                But I may well be wrong. Always easiest to criticise. I think it mostly depends what’s your goal with this idea. It wasn’t clear to me from skimming the report. And I think it may be easier or harder to achieve the goal depending on what it is. But for sure interesting as an experiment/explorarion, i.e. if the goal is still vague :)

                1. 4

                  Mostly agree with you, but I think it’s less about being available in common editors (typically editors support UTF-8, and are limited only by what fonts you have installed), and more about being available on common keyboards. If your would-be users can’t see where your control characters are printed on their keyboards, it’ll be hard to get adoption, since they won’t know how to type out their documents. The available characters generally are:

                  1. A-Z
                  2. 0-9
                  3. ~`!@#$%^&*()-_=+[{]}|;:’”,<.>/?

                  Even international keyboard are often double-printed with the local character set and arrangement, alongside a standard QWERTY character set that you can swap into using via your OS.

                  1. 2

                    Vim -editor has a :digraph -table that allows you to add control characters into the text document.

                    I think an ordinary user would most likely use a rich-text-editor with this format. I have a proposal on how to make that work, but I guess it’d be easiest to just cobble the thing together in javascript and then bring it here. I guess I’ll go check if someone has prototyped a piece-table editor in JS and start from there.

                2. 2

                  Great! Now write a Pandoc reader and your new format just might catch on ;)