The author proposes that JSON parsers should “accept comments”, without specifying what that means. Should we care about interoperability between different JSON parsers?
I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services.
My contribution is that non-standard extensions like “comments”, especially in the absence of a precise specification, make this problem worse. Different parsers that “accept comments” will probably not agree on comment syntax, and that this may cause problems or even open security holes.
For example, consider a Javascript “//” comment that continues to the end of the line. If the comment contains a CR character (aka ‘\r’ or U+000D), and the CR is not immediately followed by NL (aka ‘\n’ or U+000A), then does the CR terminate the comment, or does the comment continue to the next NL character?
One way you could react to this question is: whatever, it doesn’t matter, my app will never generate JSON that looks like this.
But to a person who crafts security exploits, this looks like an opportunity to add “cloaked” JSON elements which are processed by some JSON parsers and ignored by other JSON parsers. That kind of difference in interpretation can be the beginnings of a security exploit.
It’s not just theoretical. I skimmed the code – the linked parsers seem to have exploitable differences in how they handle line endings. In addition to \r and\n, Json5 supports \u2028 and \u2029 to end lines, dkjson.lua does not. Therefore:
will be interpreted differently by the two parsers. Swift Foundation implements a 3rd behavior; it assumes ‘\n’ and ‘\r\n’ are the only valid line endings, so:
{
// maybe a 2-line comment \r "user_id": 2
}
will be a 1 entry object when parsed by JSON5, dkjson.lua, and Sqlite, but empty when parsed by Swift.
I couldn’t be bothered to dig through the Jackson code.
I think there are 3 different use cases for JSON here:
One for serialization. In this case, yes, comments are bad because they’re complicated and are not defined in spec. But also, why would you need comments in serialization?
One for interop between different programs, that is a similar case as the above. In this case you may want comments, but it is best to avoid them (unless you’re using something like _comment: "my comment"), since you never know if the other implementations will have similar behavior
A third use case is for something more interactive, e.g.: confiuration files. In this case you probably don’t care about other consumers, having comments can be really useful. If we are assuming those files are edited by humans trailing commas are also useful for similar reasons
But to a person who crafts security exploits, this looks like an opportunity to add “cloaked” JSON elements which are processed by some JSON parsers and ignored by other JSON parsers. That kind of difference in interpretation can be the beginnings of a security exploit.
This also depends on the context. For the third use case in many cases if you have the privilege of setting a configuration file you already lost.
Using comments in your JSON doesn’t mean including comments in spec-compliant JSON!
The gift the JSON spec is giving you is that by not allowing comments it’s telling you what to do with the comments: throw them in the garbage in some step before they get to the real JSON parser. In this way, many kinds of comments can be supported (properly, without vulnerability) because once they’re gone they really can’t possibly mean anything (solving the security problem).
JSON is a very forgiving language syntactically so %#// and /* */<!-- --> are all equally valid ways to add unambiguous comments to your JSON. That is to say, since these syntaxes don’t conflict with JSON’s own syntaxes, there’s no chance of a real confusion where a comment is interpreted as valid data or visa versa. The fact that JSON absolutely rejects comments and other unexpected syntax is, paradoxically, what means that it works with every flavor of comments so long as you can define what it means to strip them out.
I think you missed the point, which was that anyone implementing such a lax parser, whether that’s via a “stripping function” or a parser which skips them natively, will do it in a slightly different way and this can cause conflicts which may be exploited. This would simply add more problems to the existing security issues JSON already has, like where parsers disagree on what to do with non-unique keys.
Can you provide a concrete example? In my mind the vulnerability is completely addressed by a stripping preparser (in a way that it is very much not addressed by a lax parser)
orib’s example of existing JSON5 parsers parsing comments differently could just as easily happen with multiple comment-stripping JSON pre-parsers. The pre-parsers could disagree on whether to strip text after a // and after one of \u2028 or \r but before \n.
To be specific the reason the problem goes away is that after the stripping parser is done the 100% formal very standard JSON parser takes over, bringing with it the exact same set of security guarantees that we have currently
This is kind of pointless. Standards aren’t about authority figures telling you what to do; they’re for interoperability and concise naming. No one is saying you can’t invent or use a “JSON + comments” syntax, it just won’t be JSON anymore. I don’t think a JSON parser should accept these extensions by default, since it’s then not parsing JSON. You can get into big trouble (up to and including kernel exploits) when your system has multiple parsers for the same syntax that differ in what they allow.
We already have a couple of these, notably JSON5. I’m not sure why you wouldn’t just adopt that: it adds several other JS features that make config files and other human authoring nicer, like optional trailing commas, single quotes, and unquoted object keys.
Honestly, if you just want a more comfortable JSON and you control the parser, why not use toml or yaml or any other text data format that was designed from the ground up to be comfortable to write and read?
I think what you said is too simplistic to be useful as an argument in an engineering context. Neither usefulness nor goodness is a single axis. We aren’t Descartes, trying to derive the universe’s mechanics from first principles; we’re trying to build systems that do a particular thing, for a particular person, to a certain end. The work is all about tradeoffs. But your comment essentially just said “having comments would be better,” without any justification or any examination of the ways in which having comments might actually be worse.
You can put comments in JSON files if you know they’re going to be read only by a reasonable parser. If you maintain a parser, you can make it accept comments and nothing bad will happen!
Then by definition it’s not JSON, but a superset such as JSON5 or VS Code’s JSON with comments syntax, because the closest we have to an official JSON specification doesn’t specify comments are supported.
While adding comments to JSON would be nice, it’s by far the least annoying part of dealing with the JSON syntax and writing a parser for it. If you also want to solve those issues, you’ll have to make certain decisions that ultimately result in something very different from JSON.
You, along with several others in this comment section, have completely missed the point of the article. It is pretty clear (at least, imo) that the author is not arguing that adding comments to JSON is conformant to the spec. They are just saying what you do doesn’t have to conform to the spec (crucially) if it isn’t necessary.
It is absolutely possible for something bad to happen if you start accepting comments in JSON. Without a spec, different parsers could define different parsing behaviour and parse the same input differently.
This can then lead to security vulnerabilities when two systems interpret the same payload differently. More details
Off the top of my head, here’s an example
{
“user_id”: 1,
// /*
“user_id”: 2,
// */
…
}
A parser supporting single line comments only may read the value of user_id to be 2 whereas a parser supporting both single and multiline comments may interpret user_id to be 1.
You may argue that the single line comments should cancel out the multiline comment, but that is an arbitrary decision you would need to have a spec clarify.
that is an arbitrary decision you would need to have a spec clarify.
It’s not arbitrary. There is only one defensible way to treat comments in Javascript Object Notation and that is the same way they are treated in Javascript.
I don’t think you will be able to get multiple parsers to agree on all the subtleties here without a spec and or associated test suite. Doug-moen above mentions a more nuanced case
For example, consider a Javascript “//” comment that continues to the end of the line. If the comment contains a CR character (aka ‘\r’ or U+000D), and the CR is not immediately followed by NL (aka ‘\n’ or U+000A), then does the CR terminate the comment, or does the comment continue to the next NL character?
Nobody has specified this in the context of JSON. Parser implementors who were told “here’s some json with comments, you need to parse this now”. People tend to be relatively lazy about reading specs at the best of times, and “JSON is kind of like Javascript, so you should think to read the ECMA spec” isn’t the best of times.
That’s going to lead to subtle differences in behavior.
And, there already are. From a 5 minute glance at the linked JSON libraries, the Javascript JSON5 accepts \u2028 and \u2029 to terminate a multi-line comment while dkjson.lua does not, and only accepts \r and \n. Swift Foundation’s parser implements yet another behavior and only recognizes ‘\n’, assuming that ‘\r’ on its own does not terminate a line.
I think you’re missing the point. From @technomancy above:
There is only one defensible way to treat comments in Javascript Object Notation and that is the same way they are treated in Javascript.
That’s the answer. Will people supporting comments in JSON as an extension do it wrong because they didn’t read the spec? Yes. Are they wrong? Also, yes. When they are wrong you tell them they are wrong, kindly open a PR to support it (as suggested on the original page). Nothing bad will happen.
The more I think about this, the more I don’t think it’s a problem; the argument is predicated on saying that it’s reasonable to have one system parse the document, validate it, throw away the parsed version, and then pass the original on to a second system. This fundamentally does not make sense; the second system should operate on the result of the first system as the first system interpreted it, not the original data.
It’s only by making the mistake of passing the original data on in the first place that you can have a situation where differences in the two parsers can cause security problems. Instead of insisting both parsers behave exactly the same way, the much simpler solution is to stop passing thru the original data.
It’s only by making the mistake of passing the original data on in the first place that you can have a situation where differences in the two parsers can cause security problems. Instead of insisting both parsers behave exactly the same way, the much simpler solution is to stop passing thru the original data.
If you get to design the whole system and ensure nothing else ever touches your things, but users and requirements often mean multiple systems will work on the same data. I think the concern is, you might say “sure i’m going to read and write my JSON and i’ll support comments like this”, but in reality if you’re writing something like JSON somewhere, there’s a very good chance someone else will come along and decide to JSON.parse it in their language of choice, completely separate from you.
The subtlety is that you need people independently to
Agree to use the ECMAscript comment standard for this adjacent language
Agree with what parts of the standard apply (many JSON parsers accept octal number literals which are valid in JavaScript but not most JSON specs)
Correctly implement that standard without a standardized test suite
Cross-implementation compatibility is already a problem for JSON without these extensions, see eg JSON Parsing is a Minefield. I’ve also personally felt this pain while trying to migrate between two JSON implementations for the same language.
It never occurred to me that anyone might think there was any ambiguity that the comments for JavaScript Object Notation should be the same as the comments for JavaScript.
There are a lot of adjectives you could use to describe YAML parsers, but “reasonable” would not be near the top of the list for me. But yes, any json5 parser would qualify for the “reasonable” list! I didn’t list any because I have never used one, but I’m open to suggestions.
I propose an experiment: tell 5 programmers to implement “same comments as JavaScript”, and see what you get :)
Personally, I think Crockford did a great job with JSON – it’s WAY better, and more precisely specified, than say CSV, and other ad hoc formats.
He wrote it down, precisely and legibly, and people agreed on it. There are like 100 implementations on https://json.org/
That is hard. Agreement is hard in 2 senses
in the intention
in the implementation
and JSON largely achieved agreement.
It’s not perfect, and I even have my own extension to JSON – to fix the JSON-Unix mismatch, i.e. allow all filenames and other strings from the kernel to be represented
But I still recognize and respect exactly what JSON is
I also agree with other commenters – you can use JSON5 or JSONC if that’s what you want, and use a different file extension / MIME type. The battle has already been “lost”
I don’t see this as something worth fighting now, and I don’t think it works in practice
Also see the recent thread about Postel’s law – writing comments as JSON doesn’t respect Postel’s law, because you’re not being conservative in what you emit
And BTW a long time ago I wrote comments in a JSON file, and the world didn’t end, for sure. No RFC police came and arrested me.
I simply removed the comments first in the code that consumed the file. I probably used sed to be honest.
But I didn’t advocate that JSON be changed, or that JSON parsers be changed. So I agree with you in one sense, that people shouldn’t worry about small things, but I also think you’re going overboard, when solutions already exist.
I prefer the following if I need to put comments into a JSON that will be read by a relatively strict parser:
{
"#comment#1": "val here is lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua",
"key": "val",
"#comment#2": "Acceptable values for foo: bar|baz",
"foo": "bar"
}
This, of course, will not work if the system not only uses a strict parser but also does a strict schema checking, disallowing unrecognized comment keys. Though I have encountered more often systems that have a strict JSON parser than a strict schema checking.
While I agree with the sentiment, that Crockford was absolutely wrong on this subject if not also many others, I feel that:
the reason to have comments is for people reading and writing the format, rather than for machine-to-machine uses of JSON
you’re going to have to modify a lot of parsers to make this broadly useful
at that point, you may as well use a different format altogether which solves all the other unfortunate parts of actually having to read and write JSON; e.g., KDL 2 is shaping up to be pretty ergonomic for people
the reason to have comments is for people reading and writing the format, rather than for machine-to-machine uses of JSON
I didn’t really want to get into in on the site because I wanted to keep it punchy, but the reason JSON is typically not used for config files is specifically that it doesn’t have comments, which pushes people towards using YAML, which leads to anger, which leads to the dark side, etc.
If this problem didn’t exist (if parser authors just made them tolerant of comments in defiance of the spec) then I think you’d see a lot less use for YAML, which would be a wonderful situation with Norway downsides.
(This doesn’t really have anything to do with machine-to-machine JSON.)
you’re going to have to modify a lot of parsers to make this broadly useful
Nope! JSON config files will almost always only be read by one parser. So each individual program can switch away from using YAML to using JSON, and each one that does is a step in the right direction regardless of how many others do it.
First, yes, YAML is atrocious; friends don’t let friends, etc.
the reason JSON is typically not used for config files is specifically that it doesn’t have comments
This part I’m not especially sure about. I think if you look at where people have pushed into YAML particulary hard, one of the main features they’re actually looking for is the ability to reproduce a long multi-line string without weird quoting. In many cases this is used to insert, say, a shell script for a stage in a build pipeline into a CI configuration document. There are no particularly good options in JSON for this. There are also often no ways to reliably specify, say, a 64-bit unsigned integer. Objects also don’t have a type name, like nodes in an XML document, so you have to work around this by having a “type” field if you need any particular structure.
There’s just a lot about JSON that is not good for, specifically, configuration files.
Nope! JSON config files will almost always only be read by one parser
If that’s true, then you’ve nullified really the only advantage JSON has over something else: that it can be read, edited, and formatted, by other tools like jq, or in most scripting languages and configuration management systems, etc!
If you’re going to improve the situation, you may as well go all the way! Drop the need for the death by a thousand double quotes, allow trailing commas, support proper dedented multiline strings, etc. Something like UCL (Universal Configuration Language) has a direct translation to a JSON data model internally if you want a relatively minimal change. Or, though it’s not my cup of tea, HashiCorp configuration language (HCL) is similarly JSON-equivalent but with more ergonomics.
Personally I think KDL is a big win for configuration files, especially with the ergonomic improvements that are coming in the nearly released KDL 2. The model is a lot more like XML, with the ability to have typed nodes and a proper understanding of being a document format rather than just a serialised JavaScript object – but without the soul crushing syntactic misdirection of XML itself.
which would be a wonderful situation with Norway downsides.
Who’s still using YAML 1.1 or prior when YAML 1.2 came out in TWO THOUSAND NINE? This hasn’t been a real problem in FIFTEEN YEARS, but always gets brought up to bash on YAML. Pick a real complaint, there’s plenty.
JSON is a serialization target. It’s an awful config format.
Python’s most popular YAML parser, PyYAML, “features a complete YAML 1.1 parser”.
When I search for how to parse YAML in Java, I find recommendations for either SnakeYAML, a YAML 1.1 parser, or Jackson. Jackson’s master branch is working to integrate a YAML 1.2 parser for v3, but the latest released version of Jackson is still v2, which uses SnakeYAML.
When I parse YAML using the Ruby standard library:
When I search PHP’s docs for “YAML”, I find yaml_parse. A comment says that the function uses the YAML 1.1 standard.
So it looks like most programming languages that existed before YAML 1.2 was released have still not migrated. Most users of these languages are parsing YAML with the 1.1 standard. For them, warnings about problems with YAML 1.1 are still relevant.
Excellent to list popular implementations here. I had a look at what the yaml egg for CHICKEN uses. It uses the libyaml C library, which looks like it is the one that PyYAML is based on or at least can optionally use when speed matters. It looks like they still have an open issue for tracking YAML 1.2 support. Since this is a C library with quite the official-sounding name, it is very likely other languages have bindings for libyaml as well.
Momentum is hard to kill. People should have moved to ruamel-yamlages ago. Popularity does not mean something is correct. SnakeYAML also has a correct 1.2 parser in its package.
I don’t really find the lack of comments to be a problem. If you really need one, just add a string value with a key of “comment” or “note” or something.
I’ve heard this suggestion a lot over the years, and I’m generally opposed to it for a few reasons:
It’s not even in the neighbourhood of aesthetically pleasing.
The most important comments I write end up being more than one line, and there is no good way to do that in JSON.
It inhibits your ability to rigorously validate the schema of a document: any misspelt or merely invalid key, or unexpected object structure, should be reported as an error! This is especially critical for configuration files.
I think I find it more aesthetically pleasing than most comments. It fits the same regular pattern as everything else in the file. It certainly makes it easier to pick out subsets of values including comments with jq.
As for multiline comments, my text editor supports line wrapping ;)
It does make schemas uglier though: having to add optional “comment” keys all over the place is a bit shit.
I don’t have very strong feelings about this. I tend to like JSON because, in most of the places where JSON is an option, the alternatives offered are YAML or XML. I don’t find JSON’s lack of comments sufficiently painful to make YAML or XML preferable.
However, if something better is available, I’m not going to fight hard for JSON over, say, TOML or Dhall or something.
I see your reasoning, and while I’d love something better than JSON, I’ll challenge that this is an area where it’s easy to make the perfect the enemy of the good.
“JSON with one extra or two small changes standardized” feels like an accessible goal to see the industry adopt en masse. It’s forward compatible and requires the least code churn and retraining, and any actively-maintained project will eventually support receiving JSON++ files just by updating its dependencies or runtime.
Trying to anoint a JSON successor (or worse, having a war of succession) has a high risk of people deciding the successor isn’t better enough to be worth the effort of migrating. Especially in cases where that means getting third parties on board with your project’s change.
I like how there are no supported comments in JSON for a few reasons:
JSON can be loaded/dumped to/from programming language structures and without worry that something was lost in translation.
There are no worries that someone uses a directive like #noqa or #ignore to mark the JSON data for processing, which I heard was the stated reason for removing comments from the original JSON spec.
It’s just simpler. “Comments” can already be supported by making a key/value called “comment” or something, which works fine without having to support a convention for comments.
I am a fan of comments in programming languages in general, but I’m glad JSON was kept simpler without them. The lack of trailing commas in arrays, on the other hand…
“Comments” can already be supported by making a key/value called “comment” or something, which works fine without having to support a convention for comments.
How would you comment individual array elements?
Even if you only need to comment key–value pairs, your suggestion relies on the convention that the consuming program will quietly ignore any keys it doesn’t recognize. If that isn’t true, then you can’t use this technique for comments. (And even if it is true, what happens when a new version of the application adds a new field to the configuration and starts assigning meaning to what was supposed to be a comment?)
your suggestion relies on the convention that the consuming program will quietly ignore any keys it doesn’t recognize…
Yes. A JSON “comment” relies on the comment being parsed with the rest of the data, which means it can impact the rest of the program. I’m arguing that this is good. It means an application that requires comments has to choose a data structure and handle that complexity instead of universally putting that complexity into the parser. What is gained is the lack of a universally independent way to access comments, for example, when reading a JSON object as a Python dictionary.
How would you comment individual array elements?
I do not have a good way to do this. In practice, we pick another place for comments, such as Git commits, a README file, or break it up over “self documenting” object key names themselves. This last one admittedly only works when we own the schema.
The first point already isn’t really true - I’ve tried to maintain structure in a package.json file before, without comments, by ordering the data in a certain way, or by inserting newlines to make things easier to read. As soon as the NPM parser reads that in and writes it out again, that structure gets lost. So even with the current minimal tools that we have, naive JSON parsing causes things to get lost in translation.
Moreover, it’s not necessarily impossible to fix this. For most situations where you’re just passing data around, a naive parser is sufficient. But when you’re trying to parse a config file, you can switch to a parser that internally records the structure of the file, comments, indentations, whitespace and all, and makes minimal modifications to that structure when writing the data out again. This is, for example, what Rust’s Cargo does with TOML files. It requires a more complicated parser, but it works well.
The second point is theoretically possible, but I’ve never really seen it come to be, not even in YAML which has comments and has been abused to essentially become a full programming language in many situations. Most tools either use normal YAML syntax to essentially represent an AST, or they use a templating language to generate YAML, completely bypassing the comments. In neither case do comments get used as directives.
The third point, as others have pointed out, is only possible if the parser chooses to accept arbitrary key value pairs and ignore them, and even then can only comment certain nodes in single-line structures and incredibly uneconomically. It is surely the worst of all options - it has exactly the same theoretical downsides of the two previous points you made, but it’s uglier, more difficult to write, and more brittle.
I am not entirely convinced by the original article - as others have pointed out, there are already security risks associated with the different approaches taken by different parsers, and adding more variation feels like asking for trouble. But I am very convinced that the three arguments you’ve given (which come up every time this topic comes up) are brought up more out of rote habit than because they are good arguments.
Going back to a discussion from a few days ago, obviously the correct thing to do is just put comments and trailing commas in your JSON and when someone rejects it demand they rewrite their parser to obey Postel’s Law (which I’m told is a universal good, and which JSON notably explicitly refuses to follow).
Q. What's an RE?
A. RE is an acronym for Record End, which is sort of like a
newline, only different.
Goldfarb's First Law of Text Processing states that:
"... if a text processing system has bugs, at least one of
them will have to do with the handling of input line endings."
[The Handbook, footnote p. 321]
The Record End concept was introduced to make sure that SGML
parsers don't violate Goldfarb's First Law.
You can put comments in JSON files if you know they’re going to be read only by a reasonable parser. If you maintain a parser, you can make it accept comments and nothing bad will happen!
There are a lot of comments here that are nitpicking but in my experience this is totally correct. Turborepo, for example, decided to use comments in their config .json file and it’s fine. The world didn’t end. The RFC police didn’t arrest them. There were no casualties.
As far as I can tell, the security risks are around “what if someone implements this incorrectly, then two different parsers could disagree about the presence of a key”. I’m not going to remove that wording, but I will add a footnote mentioning the potential downsides that apply outside the context of config files if you do it wrong.
If Douglas Crockford actually is your dad then please do not read this web site.
It’s not often that I literally laugh out loud while reading a technical post, but this time I truly did. I also do love that this is at the bottom of the page 😆
My view is, if we’re throwing out support for interoperating with embedded/retro-hobby JSON encoder/decoder X which implements RFC 8259/ECMA-404/ISO 21778:2017, and our goal is to make configuration files, we might as well use TOML instead.
It’s got date/time/timestamp literals, it’s been supported by the Python standard library since 3.11, it reads like a richer version of .ini files, and the Rust toolchain uses it, so it’s “hot new thing” compliant too.
But I specifically constrained what I said to “and our goal is to make configuration files”.
That’s not “many json use cases” and it adds a design constraint that means that, even if TOML’s not the best for nesting, the overall experience often still makes it more comfortable. (To the point where, when I’m configuring things, if the degree of nesting feels uncomfortable in TOML, I treat that as a code smell and a sign to take a second look at how I’m structuring my schema.)
Ah, I missed that you narrowed your comments. I still think nested object support is worthwhile as it handles namespaces once instead of for every subkey, but there is less potential for overlap.
I’m kind of torn on it. On the one hand, using curly braces in a config language is always going to be a drag on convenience but, on the other hand, even if YAML weren’t an overcomplicated design with too much room for edge cases to hide, I still trip over the same footguns in Ansible’s config files that I did in CoffeeScript, and do when doing outline lists in Markdown and reStructuredText.
Indentation-based markup without even Python’s block-starting colons to provide an “ISBN check digit” of sorts is just too error-prone.
I think, if I were designing a better configuration language, I’d probably go with something that resembles TOML, but with some kind of block-structuring extension analogous to what SASS’s nesting support does with CSS to allow brace-delimited nesting to be comfy without losing the comfort on the non-nested parts.
recurring phenomenon - “this belongs in a filesystem!”, “can’t we just write a FUSE module‽”, some coding and screaming later “why is this even a thing?” abort | retry | fail.
Phil worked for Heroku back in the day. A neat trick that we do: YAML is a superset of JSON and it supports comments. So we parse as YAML then as JSON (IIRC, I don’t maintain a tool that directly interacts with customer JSON).
The meta is interesting: lots of people commenting for or against it. This feels like prescriptive versus descriptive linguistics. The standard says “Ain’t ain’t a word” but the people do. Hopefully standards converge to match consensus just as Webster now has an entry for “Ain’t” but taking an absolutist position on either side is likely not the answer.
I feel like OP has a pragmatic view of “here’s what I want to see more of, let’s talk about it and promote it, even if it’s not technically Alex compliant.” Everyone arguing with the title cannot see the Forrest for the trees. He’s not literally arguing against DC, he’s advocating for something he wants to see in the world. It’s a technical “yes, and” and I like it.
But before you contact me about it, yes, I know about json5 and jsonc. Those are fine; not bad necessarily, but just a bit unnecessary, because there are no downsides to making a regular JSON parser accept comments.
[Comment removed by moderator pushcx: A one-word dismissal/mock surprise doesn't start a conversation. Maybe https://www.recurse.com/social-rules#no-feigning-surprise is a good explanation.]
The author proposes that JSON parsers should “accept comments”, without specifying what that means. Should we care about interoperability between different JSON parsers?
For a different perspective, read Parsing JSON is a Minefield 💣. Quote:
My contribution is that non-standard extensions like “comments”, especially in the absence of a precise specification, make this problem worse. Different parsers that “accept comments” will probably not agree on comment syntax, and that this may cause problems or even open security holes.
For example, consider a Javascript “//” comment that continues to the end of the line. If the comment contains a CR character (aka ‘\r’ or U+000D), and the CR is not immediately followed by NL (aka ‘\n’ or U+000A), then does the CR terminate the comment, or does the comment continue to the next NL character?
It’s not just theoretical. I skimmed the code – the linked parsers seem to have exploitable differences in how they handle line endings. In addition to \r and\n, Json5 supports \u2028 and \u2029 to end lines, dkjson.lua does not. Therefore:
will be interpreted differently by the two parsers. Swift Foundation implements a 3rd behavior; it assumes ‘\n’ and ‘\r\n’ are the only valid line endings, so:
will be a 1 entry object when parsed by JSON5, dkjson.lua, and Sqlite, but empty when parsed by Swift.
I couldn’t be bothered to dig through the Jackson code.
I think there are 3 different use cases for JSON here:
_comment: "my comment"), since you never know if the other implementations will have similar behaviorThis also depends on the context. For the third use case in many cases if you have the privilege of setting a configuration file you already lost.
Using comments in your JSON doesn’t mean including comments in spec-compliant JSON!
The gift the JSON spec is giving you is that by not allowing comments it’s telling you what to do with the comments: throw them in the garbage in some step before they get to the real JSON parser. In this way, many kinds of comments can be supported (properly, without vulnerability) because once they’re gone they really can’t possibly mean anything (solving the security problem).
JSON is a very forgiving language syntactically so
%#//and/* */<!-- -->are all equally valid ways to add unambiguous comments to your JSON. That is to say, since these syntaxes don’t conflict with JSON’s own syntaxes, there’s no chance of a real confusion where a comment is interpreted as valid data or visa versa. The fact that JSON absolutely rejects comments and other unexpected syntax is, paradoxically, what means that it works with every flavor of comments so long as you can define what it means to strip them out.I think you missed the point, which was that anyone implementing such a lax parser, whether that’s via a “stripping function” or a parser which skips them natively, will do it in a slightly different way and this can cause conflicts which may be exploited. This would simply add more problems to the existing security issues JSON already has, like where parsers disagree on what to do with non-unique keys.
Can you provide a concrete example? In my mind the vulnerability is completely addressed by a stripping preparser (in a way that it is very much not addressed by a lax parser)
orib’s example of existing JSON5 parsers parsing comments differently could just as easily happen with multiple comment-stripping JSON pre-parsers. The pre-parsers could disagree on whether to strip text after a
//and after one of\u2028or\rbut before\n.I see, yeah, that could happen. It would certainly be less of a risk if the pre-parser was highly reusable, even across different languages
To be specific the reason the problem goes away is that after the stripping parser is done the 100% formal very standard JSON parser takes over, bringing with it the exact same set of security guarantees that we have currently
This is kind of pointless. Standards aren’t about authority figures telling you what to do; they’re for interoperability and concise naming. No one is saying you can’t invent or use a “JSON + comments” syntax, it just won’t be JSON anymore. I don’t think a JSON parser should accept these extensions by default, since it’s then not parsing JSON. You can get into big trouble (up to and including kernel exploits) when your system has multiple parsers for the same syntax that differ in what they allow.
We already have a couple of these, notably JSON5. I’m not sure why you wouldn’t just adopt that: it adds several other JS features that make config files and other human authoring nicer, like optional trailing commas, single quotes, and unquoted object keys.
I agree.
Honestly, if you just want a more comfortable JSON and you control the parser, why not use toml or yaml or any other text data format that was designed from the ground up to be comfortable to write and read?
A standard’s usefulness is proportional to its goodness. I believe not having comments was a mistake, which reduces goodness.
This is an aphorism, not an argument.
Come on, he was responding to a semantic argument (json+comments is not json). I’m thankful he didn’t continue to argue semantics.
An aphorism in the hand is worth two arguments in the bush.
Perhaps. Why point that out?
I think what you said is too simplistic to be useful as an argument in an engineering context. Neither usefulness nor goodness is a single axis. We aren’t Descartes, trying to derive the universe’s mechanics from first principles; we’re trying to build systems that do a particular thing, for a particular person, to a certain end. The work is all about tradeoffs. But your comment essentially just said “having comments would be better,” without any justification or any examination of the ways in which having comments might actually be worse.
I think you’re right.
On that note, if I remember correctly, Crockfold removed comments as he envisioned them being used for parser directives.
[Comment removed by author]
I think you hit the nail on the head here.
Then by definition it’s not JSON, but a superset such as JSON5 or VS Code’s JSON with comments syntax, because the closest we have to an official JSON specification doesn’t specify comments are supported.
While adding comments to JSON would be nice, it’s by far the least annoying part of dealing with the JSON syntax and writing a parser for it. If you also want to solve those issues, you’ll have to make certain decisions that ultimately result in something very different from JSON.
You, along with several others in this comment section, have completely missed the point of the article. It is pretty clear (at least, imo) that the author is not arguing that adding comments to JSON is conformant to the spec. They are just saying what you do doesn’t have to conform to the spec (crucially) if it isn’t necessary.
What if my dad is RFC 8259?
You write your JSON with m4, have your comments with lines starting with
dnland pre-parse it with itm4It is absolutely possible for something bad to happen if you start accepting comments in JSON. Without a spec, different parsers could define different parsing behaviour and parse the same input differently.
This can then lead to security vulnerabilities when two systems interpret the same payload differently. More details
Off the top of my head, here’s an example
A parser supporting single line comments only may read the value of user_id to be 2 whereas a parser supporting both single and multiline comments may interpret user_id to be 1.
You may argue that the single line comments should cancel out the multiline comment, but that is an arbitrary decision you would need to have a spec clarify.
It’s not arbitrary. There is only one defensible way to treat comments in Javascript Object Notation and that is the same way they are treated in Javascript.
I don’t think you will be able to get multiple parsers to agree on all the subtleties here without a spec and or associated test suite. Doug-moen above mentions a more nuanced case
What does the ECMA standard say in this example? Then that’s the answer.
Nobody has specified this in the context of JSON. Parser implementors who were told “here’s some json with comments, you need to parse this now”. People tend to be relatively lazy about reading specs at the best of times, and “JSON is kind of like Javascript, so you should think to read the ECMA spec” isn’t the best of times.
That’s going to lead to subtle differences in behavior.
And, there already are. From a 5 minute glance at the linked JSON libraries, the Javascript JSON5 accepts \u2028 and \u2029 to terminate a multi-line comment while dkjson.lua does not, and only accepts \r and \n. Swift Foundation’s parser implements yet another behavior and only recognizes ‘\n’, assuming that ‘\r’ on its own does not terminate a line.
I think you’re missing the point. From @technomancy above:
That’s the answer. Will people supporting comments in JSON as an extension do it wrong because they didn’t read the spec? Yes. Are they wrong? Also, yes. When they are wrong you tell them they are wrong, kindly open a PR to support it (as suggested on the original page). Nothing bad will happen.
ECMAScript 2025 § 12.3 Line Terminators makes it clear that a CR not followed by LF is itself considered a line terminator.
There you go. That’s the answer for comments in JSON. “Follow what JavaScript does” works.
I’m inclined to agree!
The more I think about this, the more I don’t think it’s a problem; the argument is predicated on saying that it’s reasonable to have one system parse the document, validate it, throw away the parsed version, and then pass the original on to a second system. This fundamentally does not make sense; the second system should operate on the result of the first system as the first system interpreted it, not the original data.
It’s only by making the mistake of passing the original data on in the first place that you can have a situation where differences in the two parsers can cause security problems. Instead of insisting both parsers behave exactly the same way, the much simpler solution is to stop passing thru the original data.
If you get to design the whole system and ensure nothing else ever touches your things, but users and requirements often mean multiple systems will work on the same data. I think the concern is, you might say “sure i’m going to read and write my JSON and i’ll support comments like this”, but in reality if you’re writing something like JSON somewhere, there’s a very good chance someone else will come along and decide to
JSON.parseit in their language of choice, completely separate from you.This seems like a very reasonable opinion to a lay person (me). Are there any subtleties in the EMCAScript standard that it misses?
The subtlety is that you need people independently to
Cross-implementation compatibility is already a problem for JSON without these extensions, see eg JSON Parsing is a Minefield. I’ve also personally felt this pain while trying to migrate between two JSON implementations for the same language.
One could argue that’s no longer JSON, that’s a superset of JSON. A relatively well defined superset but still.
What is the format of comments in JSON that reasonable parsers accept?
Yeah exactly, this page would be more interesting if it actually said that
The hard part is the agreement, not coming up with something
It never occurred to me that anyone might think there was any ambiguity that the comments for JavaScript Object Notation should be the same as the comments for JavaScript.
YAML and JSON5 are both supersets of JSON but they disagree on the the commenting style (# vs //)
Depending on what the author accepts as “reasonable” they could list either “All YAML parsers” or All JSON5 parsers” in “Reasonable Parsers”.
There are a lot of adjectives you could use to describe YAML parsers, but “reasonable” would not be near the top of the list for me. But yes, any json5 parser would qualify for the “reasonable” list! I didn’t list any because I have never used one, but I’m open to suggestions.
Apple’s JSON decoder has a json5 mode. SQLite supports json5 as of last year.
I propose an experiment: tell 5 programmers to implement “same comments as JavaScript”, and see what you get :)
Personally, I think Crockford did a great job with JSON – it’s WAY better, and more precisely specified, than say CSV, and other ad hoc formats.
He wrote it down, precisely and legibly, and people agreed on it. There are like 100 implementations on https://json.org/
That is hard. Agreement is hard in 2 senses
and JSON largely achieved agreement.
It’s not perfect, and I even have my own extension to JSON – to fix the JSON-Unix mismatch, i.e. allow all filenames and other strings from the kernel to be represented
But I still recognize and respect exactly what JSON is
I also agree with other commenters – you can use JSON5 or JSONC if that’s what you want, and use a different file extension / MIME type. The battle has already been “lost”
@doug-moen and @snej kinda said the same thing as me
I don’t see this as something worth fighting now, and I don’t think it works in practice
Also see the recent thread about Postel’s law – writing comments as JSON doesn’t respect Postel’s law, because you’re not being conservative in what you emit
https://lobste.rs/s/tsz4c9/harmful_consequences_robustness
And BTW a long time ago I wrote comments in a JSON file, and the world didn’t end, for sure. No RFC police came and arrested me.
I simply removed the comments first in the code that consumed the file. I probably used
sedto be honest.But I didn’t advocate that JSON be changed, or that JSON parsers be changed. So I agree with you in one sense, that people shouldn’t worry about small things, but I also think you’re going overboard, when solutions already exist.
The linked parsers already have divergences in how they parse comments.
The first thing that came to my mind is to implement // and nothing else.
So the first thing that came to mind is to make comments impossible to round-trip without also forcing pretty printing on the serialiser?
Yes. My imagination is that they would be just stripped.
I prefer the following if I need to put comments into a JSON that will be read by a relatively strict parser:
This, of course, will not work if the system not only uses a strict parser but also does a strict schema checking, disallowing unrecognized comment keys. Though I have encountered more often systems that have a strict JSON parser than a strict schema checking.
While I agree with the sentiment, that Crockford was absolutely wrong on this subject if not also many others, I feel that:
I didn’t really want to get into in on the site because I wanted to keep it punchy, but the reason JSON is typically not used for config files is specifically that it doesn’t have comments, which pushes people towards using YAML, which leads to anger, which leads to the dark side, etc.
If this problem didn’t exist (if parser authors just made them tolerant of comments in defiance of the spec) then I think you’d see a lot less use for YAML, which would be a wonderful situation with Norway downsides.
(This doesn’t really have anything to do with machine-to-machine JSON.)
Nope! JSON config files will almost always only be read by one parser. So each individual program can switch away from using YAML to using JSON, and each one that does is a step in the right direction regardless of how many others do it.
First, yes, YAML is atrocious; friends don’t let friends, etc.
This part I’m not especially sure about. I think if you look at where people have pushed into YAML particulary hard, one of the main features they’re actually looking for is the ability to reproduce a long multi-line string without weird quoting. In many cases this is used to insert, say, a shell script for a stage in a build pipeline into a CI configuration document. There are no particularly good options in JSON for this. There are also often no ways to reliably specify, say, a 64-bit unsigned integer. Objects also don’t have a type name, like nodes in an XML document, so you have to work around this by having a “type” field if you need any particular structure.
There’s just a lot about JSON that is not good for, specifically, configuration files.
If that’s true, then you’ve nullified really the only advantage JSON has over something else: that it can be read, edited, and formatted, by other tools like jq, or in most scripting languages and configuration management systems, etc!
If you’re going to improve the situation, you may as well go all the way! Drop the need for the death by a thousand double quotes, allow trailing commas, support proper dedented multiline strings, etc. Something like UCL (Universal Configuration Language) has a direct translation to a JSON data model internally if you want a relatively minimal change. Or, though it’s not my cup of tea, HashiCorp configuration language (HCL) is similarly JSON-equivalent but with more ergonomics.
Personally I think KDL is a big win for configuration files, especially with the ergonomic improvements that are coming in the nearly released KDL 2. The model is a lot more like XML, with the ability to have typed nodes and a proper understanding of being a document format rather than just a serialised JavaScript object – but without the soul crushing syntactic misdirection of XML itself.
Who’s still using YAML 1.1 or prior when YAML 1.2 came out in TWO THOUSAND NINE? This hasn’t been a real problem in FIFTEEN YEARS, but always gets brought up to bash on YAML. Pick a real complaint, there’s plenty.
JSON is a serialization target. It’s an awful config format.
Python’s most popular YAML parser, PyYAML, “features a complete YAML 1.1 parser”.
When I search for how to parse YAML in Java, I find recommendations for either SnakeYAML, a YAML 1.1 parser, or Jackson. Jackson’s master branch is working to integrate a YAML 1.2 parser for v3, but the latest released version of Jackson is still v2, which uses SnakeYAML.
When I parse YAML using the Ruby standard library:
When I search PHP’s docs for “YAML”, I find
yaml_parse. A comment says that the function uses the YAML 1.1 standard.So it looks like most programming languages that existed before YAML 1.2 was released have still not migrated. Most users of these languages are parsing YAML with the 1.1 standard. For them, warnings about problems with YAML 1.1 are still relevant.
Excellent to list popular implementations here. I had a look at what the yaml egg for CHICKEN uses. It uses the libyaml C library, which looks like it is the one that PyYAML is based on or at least can optionally use when speed matters. It looks like they still have an open issue for tracking YAML 1.2 support. Since this is a C library with quite the official-sounding name, it is very likely other languages have bindings for libyaml as well.
Momentum is hard to kill. People should have moved to
ruamel-yamlages ago. Popularity does not mean something is correct. SnakeYAML also has a correct 1.2 parser in its package.lol!
I don’t really find the lack of comments to be a problem. If you really need one, just add a string value with a key of “comment” or “note” or something.
I’ve heard this suggestion a lot over the years, and I’m generally opposed to it for a few reasons:
jq.I don’t have very strong feelings about this. I tend to like JSON because, in most of the places where JSON is an option, the alternatives offered are YAML or XML. I don’t find JSON’s lack of comments sufficiently painful to make YAML or XML preferable.
However, if something better is available, I’m not going to fight hard for JSON over, say, TOML or Dhall or something.
I see your reasoning, and while I’d love something better than JSON, I’ll challenge that this is an area where it’s easy to make the perfect the enemy of the good.
“JSON with one extra or two small changes standardized” feels like an accessible goal to see the industry adopt en masse. It’s forward compatible and requires the least code churn and retraining, and any actively-maintained project will eventually support receiving JSON++ files just by updating its dependencies or runtime.
Trying to anoint a JSON successor (or worse, having a war of succession) has a high risk of people deciding the successor isn’t better enough to be worth the effort of migrating. Especially in cases where that means getting third parties on board with your project’s change.
You might like JWCC: JSON With Commas and Comments.
https://nigeltao.github.io/blog/2021/json-with-commas-comments.html
I like how there are no supported comments in JSON for a few reasons:
#noqaor#ignoreto mark the JSON data for processing, which I heard was the stated reason for removing comments from the original JSON spec.I am a fan of comments in programming languages in general, but I’m glad JSON was kept simpler without them. The lack of trailing commas in arrays, on the other hand…
How would you comment individual array elements?
Even if you only need to comment key–value pairs, your suggestion relies on the convention that the consuming program will quietly ignore any keys it doesn’t recognize. If that isn’t true, then you can’t use this technique for comments. (And even if it is true, what happens when a new version of the application adds a new field to the configuration and starts assigning meaning to what was supposed to be a comment?)
Yes. A JSON “comment” relies on the comment being parsed with the rest of the data, which means it can impact the rest of the program. I’m arguing that this is good. It means an application that requires comments has to choose a data structure and handle that complexity instead of universally putting that complexity into the parser. What is gained is the lack of a universally independent way to access comments, for example, when reading a JSON object as a Python dictionary.
I do not have a good way to do this. In practice, we pick another place for comments, such as Git commits, a README file, or break it up over “self documenting” object key names themselves. This last one admittedly only works when we own the schema.
The first point already isn’t really true - I’ve tried to maintain structure in a
package.jsonfile before, without comments, by ordering the data in a certain way, or by inserting newlines to make things easier to read. As soon as the NPM parser reads that in and writes it out again, that structure gets lost. So even with the current minimal tools that we have, naive JSON parsing causes things to get lost in translation.Moreover, it’s not necessarily impossible to fix this. For most situations where you’re just passing data around, a naive parser is sufficient. But when you’re trying to parse a config file, you can switch to a parser that internally records the structure of the file, comments, indentations, whitespace and all, and makes minimal modifications to that structure when writing the data out again. This is, for example, what Rust’s Cargo does with TOML files. It requires a more complicated parser, but it works well.
The second point is theoretically possible, but I’ve never really seen it come to be, not even in YAML which has comments and has been abused to essentially become a full programming language in many situations. Most tools either use normal YAML syntax to essentially represent an AST, or they use a templating language to generate YAML, completely bypassing the comments. In neither case do comments get used as directives.
The third point, as others have pointed out, is only possible if the parser chooses to accept arbitrary key value pairs and ignore them, and even then can only comment certain nodes in single-line structures and incredibly uneconomically. It is surely the worst of all options - it has exactly the same theoretical downsides of the two previous points you made, but it’s uglier, more difficult to write, and more brittle.
I am not entirely convinced by the original article - as others have pointed out, there are already security risks associated with the different approaches taken by different parsers, and adding more variation feels like asking for trouble. But I am very convinced that the three arguments you’ve given (which come up every time this topic comes up) are brought up more out of rote habit than because they are good arguments.
Going back to a discussion from a few days ago, obviously the correct thing to do is just put comments and trailing commas in your JSON and when someone rejects it demand they rewrite their parser to obey Postel’s Law (which I’m told is a universal good, and which JSON notably explicitly refuses to follow).
Don’t forget to include a trailing newline. It’ll never cause problems.
http://www.flightlab.com/~joe/sgml/faq-not.txt
There are a lot of comments here that are nitpicking but in my experience this is totally correct. Turborepo, for example, decided to use comments in their config .json file and it’s fine. The world didn’t end. The RFC police didn’t arrest them. There were no casualties.
As one of the nit-pickers, I’ll say I also agree with that point. What is more iffy is
For the security risks mentioned elsewhere.
As far as I can tell, the security risks are around “what if someone implements this incorrectly, then two different parsers could disagree about the presence of a key”. I’m not going to remove that wording, but I will add a footnote mentioning the potential downsides that apply outside the context of config files if you do it wrong.
Thanks!
I’m old enough to remember when different JS engines tokenized comments differently, which we used to do browser detection.
my dad is xkcd 927
It’s not often that I literally laugh out loud while reading a technical post, but this time I truly did. I also do love that this is at the bottom of the page 😆
My view is, if we’re throwing out support for interoperating with embedded/retro-hobby JSON encoder/decoder X which implements RFC 8259/ECMA-404/ISO 21778:2017, and our goal is to make configuration files, we might as well use TOML instead.
It’s got date/time/timestamp literals, it’s been supported by the Python standard library since 3.11, it reads like a richer version of
.inifiles, and the Rust toolchain uses it, so it’s “hot new thing” compliant too.Counterpoint, TOML has horrible support for nested objects, which makes it a poor fit for many json use cases.
But I specifically constrained what I said to “and our goal is to make configuration files”.
That’s not “many json use cases” and it adds a design constraint that means that, even if TOML’s not the best for nesting, the overall experience often still makes it more comfortable. (To the point where, when I’m configuring things, if the degree of nesting feels uncomfortable in TOML, I treat that as a code smell and a sign to take a second look at how I’m structuring my schema.)
Ah, I missed that you narrowed your comments. I still think nested object support is worthwhile as it handles namespaces once instead of for every subkey, but there is less potential for overlap.
That’s fair.
I’m kind of torn on it. On the one hand, using curly braces in a config language is always going to be a drag on convenience but, on the other hand, even if YAML weren’t an overcomplicated design with too much room for edge cases to hide, I still trip over the same footguns in Ansible’s config files that I did in CoffeeScript, and do when doing outline lists in Markdown and reStructuredText.
Indentation-based markup without even Python’s block-starting colons to provide an “ISBN check digit” of sorts is just too error-prone.
I think, if I were designing a better configuration language, I’d probably go with something that resembles TOML, but with some kind of block-structuring extension analogous to what SASS’s nesting support does with CSS to allow brace-delimited nesting to be comfy without losing the comfort on the non-nested parts.
This sounds overly complex. Why don’t we just write a FUSE module that will strip comments away on read - no need to edit any parser.
I can’t tell if this is /j, if you’ve never written a FUSE filesystem for production use, or if Douglas Crockford is your dad.
recurring phenomenon - “this belongs in a filesystem!”, “can’t we just write a FUSE module‽”, some coding and screaming later “why is this even a thing?” abort | retry | fail.
Yep :D I used to say things like that. Then I worked on the VFS for Git Linux port. I no longer say things like that.
I have the entirety of my desktop exposed as a FUSE filesystem still, it makes for interesting fuzzing but also a harrowing learning experience.
No, just don’t put weird shit into your JSON. JSON is not for human editing, it’s terrible for human editing.
Phil worked for Heroku back in the day. A neat trick that we do: YAML is a superset of JSON and it supports comments. So we parse as YAML then as JSON (IIRC, I don’t maintain a tool that directly interacts with customer JSON).
The meta is interesting: lots of people commenting for or against it. This feels like prescriptive versus descriptive linguistics. The standard says “Ain’t ain’t a word” but the people do. Hopefully standards converge to match consensus just as Webster now has an entry for “Ain’t” but taking an absolutist position on either side is likely not the answer.
I feel like OP has a pragmatic view of “here’s what I want to see more of, let’s talk about it and promote it, even if it’s not technically Alex compliant.” Everyone arguing with the title cannot see the Forrest for the trees. He’s not literally arguing against DC, he’s advocating for something he wants to see in the world. It’s a technical “yes, and” and I like it.
I’m in. Society / de-facto standards are the real standards. Douglas ain’t my dad. Game theory.
But do you know about https://hjson.github.io ?
If I wanted to add comments (and trailing commas, and other goodies)
I would choose a subset of
ast.literal_eval()frompython3, instead of a superset ofjson.A big advantage in my point of view is that you are one step ahead of writing a python program
which looks necessary if you start writing comments on your json file.
[Comment removed by moderator pushcx: A one-word dismissal/mock surprise doesn't start a conversation. Maybe https://www.recurse.com/social-rules#no-feigning-surprise is a good explanation.]
[Comment removed by moderator pushcx: Pruning empty thread.]