This is a great idea! By way of feedback, I had to read the “Three Options For Displaying Unicode” section a few times. It talks about encoders doing decoding, which I figured out means speculatively decoding the byte string into Unicode codepoints to encode it as a \u{} literal. It also talks about “showing”. Some terminology to refine.
The remark about single quotes being unrepresentable as single quoted strings made me wonder what printf %q does with single quotes. It looks like any (shell) word will do:
I suppose that last one is an additional reason not to consider shell syntax suitable as a general interchange format, though it hasn’t been a issue for prior art like /etc/os-release.
Ah /etc/os-release is interesting! I’ve seen other formats in packages managers like this. If your characters are usually limited, this can be OK. Although most non-shell users don’t know that shell strings can be multiline, unlike in Python or C!
Thanks for the feedback on that section. I rewrote it:
Should the encoding algorithm also treat unprintable characters specially? E.g., always replace them with escape sequences? For example \x1b (ESC), which denotes ANSI control sequences.
This kinda raises the question of what properties QPN is supposed to have. Should it round trip through copy and paste? (ANSI color codes don’t.) Should a QPN string be inert when printed in a terminal? (Probably yes; shell injection is bad.) What about a Unicode LTR marker?
This strikes me as analogous to escaping of < and > by a JSON encoder — it’s not required by the spec, but some encoders (e.g. Go’s) do it by default, acknowledging that the encoded output may be included in an HTML document without being properly escaped.
If you want to write an implementation (which I think is fun, especially the little UTF-8 state machine which you can see in Python), let me know at any method on http://www.oilshell.org/ (e-mail, Zulip, etc.)
Use single quotes instead of double quotes to surround the string (to avoid confusion with JSON)
Unfortunately, this also causes confusion because shell (and perl, ruby, php, raku, …) use single quotes to indicated literal strings and double quotes to indicate escapable strings. Since QSN’s primary consumer is intended to be oil, a shell replacement, I think it’s more important that it be familiar to users of shell and scripting languages than that it be familiar to users of JSON.
This is good feedback, and makes me realize that the doc is unclear. The primary consumer is not meant to be Oil!
Analogously, the primary consumer of JSON is not JavaScript (anymore). JSON is used by systems entirely outside of the browser, written in Python, Erlang, etc.
So QSN is like UTF-8 or HTML: it’s to supposed to be emitted and consumed by tools written in any programming language.
That is why I said “interchange format”, although I will elaborate on this in the doc.
Also, I considered literally using shell strings, but I used Rust strings instead because:
Single quoted strings in POSIX shell can’t represent arbitrary byte strings. In particular, ANY strings containing a single quote! This is the primary requirement of QSN!
The bash $'foo\n' extension can, but it has cruft like octal strings (does it have 3 or 4 digits? bash is inconsistent) and \v vertical tabs. Rust strings clean all that up.
Double quoted strings have extra syntax that QSN doesn’t have: $var, $(command sub), etc.
So QSN can’t literally be any kind of shell string. (Unlike JSON, which is literally a JavaScript string.)
I decided to break compatibility with Rust strings because otherwise json.loads(qsn_string) would work 99% of the time:
for "foo bar"
for "1\t\n"
for "\""
Those are all valid JSON, and QSN except for the quotes. I want it to be a syntax error 100% of the time.
If you’ve ever seen the monstrosities people come up with in JSON – like JSON double-escaped inside JSON strings, you might see why. There is a lot of “string confusion” out there.
Thanks for the feedback! More comments/questions are welcome.
This is a great idea! By way of feedback, I had to read the “Three Options For Displaying Unicode” section a few times. It talks about encoders doing decoding, which I figured out means speculatively decoding the byte string into Unicode codepoints to encode it as a
\u{}
literal. It also talks about “showing”. Some terminology to refine.The remark about single quotes being unrepresentable as single quoted strings made me wonder what
printf %q
does with single quotes. It looks like any (shell) word will do:I suppose that last one is an additional reason not to consider shell syntax suitable as a general interchange format, though it hasn’t been a issue for prior art like
/etc/os-release
.Ah
/etc/os-release
is interesting! I’ve seen other formats in packages managers like this. If your characters are usually limited, this can be OK. Although most non-shell users don’t know that shell strings can be multiline, unlike in Python or C!Thanks for the feedback on that section. I rewrote it:
http://www.oilshell.org/preview/doc/qsn.html#how-does-a-qsn-encoder-deal-with-unicode
The new phrasing is much more clear!
Should the encoding algorithm also treat unprintable characters specially? E.g., always replace them with escape sequences? For example
\x1b
(ESC), which denotes ANSI control sequences.This kinda raises the question of what properties QPN is supposed to have. Should it round trip through copy and paste? (ANSI color codes don’t.) Should a QPN string be inert when printed in a terminal? (Probably yes; shell injection is bad.) What about a Unicode LTR marker?
This strikes me as analogous to escaping of
<
and>
by a JSON encoder — it’s not required by the spec, but some encoders (e.g. Go’s) do it by default, acknowledging that the encoded output may be included in an HTML document without being properly escaped.Good question, I described it a bit here:
http://www.oilshell.org/preview/doc/qsn.html#which-bytes-should-be-hex-escaped
And I linked the reference implementation.
https://github.com/oilshell/oil/tree/master/qsn_
If you want to write an implementation (which I think is fun, especially the little UTF-8 state machine which you can see in Python), let me know at any method on http://www.oilshell.org/ (e-mail, Zulip, etc.)
Unfortunately, this also causes confusion because shell (and perl, ruby, php, raku, …) use single quotes to indicated literal strings and double quotes to indicate escapable strings. Since QSN’s primary consumer is intended to be oil, a shell replacement, I think it’s more important that it be familiar to users of shell and scripting languages than that it be familiar to users of JSON.
This is good feedback, and makes me realize that the doc is unclear. The primary consumer is not meant to be Oil!
Analogously, the primary consumer of JSON is not JavaScript (anymore). JSON is used by systems entirely outside of the browser, written in Python, Erlang, etc.
So QSN is like UTF-8 or HTML: it’s to supposed to be emitted and consumed by tools written in any programming language.
That is why I said “interchange format”, although I will elaborate on this in the doc.
Also, I considered literally using shell strings, but I used Rust strings instead because:
$'foo\n'
extension can, but it has cruft like octal strings (does it have 3 or 4 digits? bash is inconsistent) and\v
vertical tabs. Rust strings clean all that up.$var
,$(command sub)
, etc.So QSN can’t literally be any kind of shell string. (Unlike JSON, which is literally a JavaScript string.)
I decided to break compatibility with Rust strings because otherwise
json.loads(qsn_string)
would work 99% of the time:"foo bar"
"1\t\n"
"\""
Those are all valid JSON, and QSN except for the quotes. I want it to be a syntax error 100% of the time.
If you’ve ever seen the monstrosities people come up with in JSON – like JSON double-escaped inside JSON strings, you might see why. There is a lot of “string confusion” out there.
Thanks for the feedback! More comments/questions are welcome.
FYI I updated the doc: http://www.oilshell.org/preview/doc/qsn.html
Let me know if it doesn’t make sense!