Kind of surprising to me that JS engines would be sticking to UTF-16 despite so much content being UTF-8. I wonder if it would be a worthwhile change in practice to do that kind of migration?
JavaScript used UTF-16 because it wanted to look like (and interoperate with) Java, which used UTF-16. Java used UTF-16 because a lot of the designers worked on the OpenStep specification, which used UTF-16. OpenStep used UTF-16 because they added a unichar type back when 16 bits was sufficient for all of Unicode and couldn’t change it without an ABI break. The same story as the Windows APIs.
There are good reasons for UTF-16 (e.g. better cache usage on CJK character sets than UTF-8 or UTF-16), but none of them apply in typical JavaScript.
JSON has the interesting property that it is encoding agnostic without a byte-order mark. The first character, as I recall, must be either a { or [, and these have different byte sequences in UTF-8, UTF-16, or UTF-32, in either byte order. Apparently recent versions of the spec require UTF-8, but earlier versions had an optional BOM, and some tools added one. Fortunately, the BOM also has an unambiguous encoding and so you can easily detect it and detect encoding with or without it.
This may have changed in more recent versions, but it was not last time I read the spec (>10 years ago). If parsers reject it unless you pass an extra flag, that’s usually a hint that it’s a non-standard extension.
A JSON text is a serialized value. Note that certain previous
specifications of JSON constrained a JSON text to be an object or an
array. Implementations that generate only objects or arrays where a
JSON text is called for will be interoperable in the sense that all
implementations will accept these as conforming JSON texts.
JSON has always supported raw literals - the “exceptions” are due to JS properties that people think are literals: undefined, Infinity and NaN. Because the original JSON parser was actually just “validating” (via regex) strings downloaded from the internet and throwing them to eval. This was such a significant part of the net for such a long time that I made JSC’s eval implementation first try to parse the input as a JSON string before throwing it at the interpreter. Because non-jSON tokens are hit fairly rapidly in non-JSON scripts the cost is negligible but the win when someone is actually trying to parse giant amounts of JSON is orders of magnitude both in CPU time and memory usage.
I found the original JSON rfc proposed by Crockford and the “implementation” he includes even in that supports top level values. The reality is that most of the problems in JSON’s syntax boil down to Crockford wanting to be able to avoid actually parsing anything and just pass strings to eval, e.g the aforementioned Infinity, NaN, and undefined “keywords” not being supported, the lack of comments, etc.
Switching JS strings to UTF-8 would have the effect of making all JS ever written very subtly wrong. It would probably take a decade for everyone to successfully migrate.
One language I’m aware of that successfully made the switch is Swift, but it was still relatively early in its life (5 years after initial release): https://www.swift.org/blog/utf8-string/ It was part of a large release with many desirable features, and the compiler for years (still, I think) supported both a Swift 5 and Swift 4 more with per-module granularity.
JavaScript strings aren’t utf16, they’re ucs-2. Web pages display the content of such strings as if they were utf16, but the JS string representation from the PoV of the language is not. There’s a semi-joke spec WTF-16 that’s used to describe how browsers have to interpret things.
The core issue is that JS strings are exposed to the language as sequences of unrelated 16bit values, which means they can (and do) contain invalid utf16 sequences. Because of that there’s no way to robustly go back and forth between the ucs2 data and utf-16 without potentially losing data, and from there you can see why you also can’t go to utf-8. Note that this wouldn’t have been avoided by character iterators rather JS’s historical indexing because the iterators would have been over a sequence of 16bit “characters” as well \o/
Now all that aside, the major production level JS engines (at this point just SM, JSC, and V8) all do a dual encoding such that any string that doesn’t contain a character value greater than 127 is stored and processed as 1 byte per character. The performance sensitivity of this is such that (via the wonders of templates) JSC at least functionally has 2 complete copies of the JS parser (technically there are 4 because the JSC parser has validating vs AST construction modes but the codegen for the validating mode is so much smaller than when building an AST that AST side is the important bit). Similarly the regex engines will compile multiple versions of the regex to handle 8bit vs 16bit strings.
Although Douglas Crockford couldn’t change the spec forcing all implementations
to error on duplicate, his Java JSON implementation errors on duplicate names.
Others use last-value-wins, support duplicate keys, or other non-standard
behavior. The JSON
RFC states that
implementations should not allow duplicate keys, notes the varying behavior
of existing implementations, and states that when names are not unique, “the
behavior of software that receives such an object is unpredictable.” Also note
that Javascript objects (ES6) and Go structs already require unique names.
Kind of surprising to me that JS engines would be sticking to UTF-16 despite so much content being UTF-8. I wonder if it would be a worthwhile change in practice to do that kind of migration?
JavaScript used UTF-16 because it wanted to look like (and interoperate with) Java, which used UTF-16. Java used UTF-16 because a lot of the designers worked on the OpenStep specification, which used UTF-16. OpenStep used UTF-16 because they added a unichar type back when 16 bits was sufficient for all of Unicode and couldn’t change it without an ABI break. The same story as the Windows APIs.
There are good reasons for UTF-16 (e.g. better cache usage on CJK character sets than UTF-8 or UTF-16), but none of them apply in typical JavaScript.
JSON has the interesting property that it is encoding agnostic without a byte-order mark. The first character, as I recall, must be either a { or [, and these have different byte sequences in UTF-8, UTF-16, or UTF-32, in either byte order. Apparently recent versions of the spec require UTF-8, but earlier versions had an optional BOM, and some tools added one. Fortunately, the BOM also has an unambiguous encoding and so you can easily detect it and detect encoding with or without it.
0
,true
, and"foo"
are also valid JSONs. I think some parsers reject it without a flag passed, but AFAICT, it’s spec-compliant.This may have changed in more recent versions, but it was not last time I read the spec (>10 years ago). If parsers reject it unless you pass an extra flag, that’s usually a hint that it’s a non-standard extension.
See https://datatracker.ietf.org/doc/html/rfc8259#section-2
JSON has always supported raw literals - the “exceptions” are due to JS properties that people think are literals: undefined, Infinity and NaN. Because the original JSON parser was actually just “validating” (via regex) strings downloaded from the internet and throwing them to eval. This was such a significant part of the net for such a long time that I made JSC’s eval implementation first try to parse the input as a JSON string before throwing it at the interpreter. Because non-jSON tokens are hit fairly rapidly in non-JSON scripts the cost is negligible but the win when someone is actually trying to parse giant amounts of JSON is orders of magnitude both in CPU time and memory usage.
I found the original JSON rfc proposed by Crockford and the “implementation” he includes even in that supports top level values. The reality is that most of the problems in JSON’s syntax boil down to Crockford wanting to be able to avoid actually parsing anything and just pass strings to eval, e.g the aforementioned Infinity, NaN, and undefined “keywords” not being supported, the lack of comments, etc.
Switching JS strings to UTF-8 would have the effect of making all JS ever written very subtly wrong. It would probably take a decade for everyone to successfully migrate.
One language I’m aware of that successfully made the switch is Swift, but it was still relatively early in its life (5 years after initial release): https://www.swift.org/blog/utf8-string/ It was part of a large release with many desirable features, and the compiler for years (still, I think) supported both a Swift 5 and Swift 4 more with per-module granularity.
JavaScript strings aren’t utf16, they’re ucs-2. Web pages display the content of such strings as if they were utf16, but the JS string representation from the PoV of the language is not. There’s a semi-joke spec WTF-16 that’s used to describe how browsers have to interpret things.
The core issue is that JS strings are exposed to the language as sequences of unrelated 16bit values, which means they can (and do) contain invalid utf16 sequences. Because of that there’s no way to robustly go back and forth between the ucs2 data and utf-16 without potentially losing data, and from there you can see why you also can’t go to utf-8. Note that this wouldn’t have been avoided by character iterators rather JS’s historical indexing because the iterators would have been over a sequence of 16bit “characters” as well \o/
Now all that aside, the major production level JS engines (at this point just SM, JSC, and V8) all do a dual encoding such that any string that doesn’t contain a character value greater than 127 is stored and processed as 1 byte per character. The performance sensitivity of this is such that (via the wonders of templates) JSC at least functionally has 2 complete copies of the JS parser (technically there are 4 because the JSC parser has validating vs AST construction modes but the codegen for the validating mode is so much smaller than when building an AST that AST side is the important bit). Similarly the regex engines will compile multiple versions of the regex to handle 8bit vs 16bit strings.
The latest JSON RFC requires UTF-8.
https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
On a different note, when dealing with JSON recently, one of my largest pain points is when parsers do not error on duplicate keys.
Douglas Crockford, JSON’s inventor, tried to fix this but it was decided it was too late.
Although Douglas Crockford couldn’t change the spec forcing all implementations to error on duplicate, his Java JSON implementation errors on duplicate names. Others use
last-value-wins
, support duplicate keys, or other non-standard behavior. The JSON RFC states that implementations should not allow duplicate keys, notes the varying behavior of existing implementations, and states that when names are not unique, “the behavior of software that receives such an object is unpredictable.” Also note that Javascript objects (ES6) and Go structs already require unique names.Duplicate fields are a security issue, a source of bugs, and a surprising behavior to users. See the article, “An Exploration of JSON Interoperability Vulnerabilities”
Disallowing duplicates conforms to the small I-JSON RFC. The author of I-JSON, Tim Bray, is also the author of JSON RFC 8259.
I tried to convince JSON5 to disallow duplicates but they too thought it was too late for JSON5 as well.