This is an excellent resource! I worked on a feed reader from 2003-2007, and broken feeds were a constant annoyance. A lot of this seemed to be caused by generating the feed using the same template engine as the HTML but not taking account of the fact that it’s supposed to be XML
I hope the situation is better now, but the major mistakes I saw then were:
Invalid XML, probably caused by sloppy code generating it. People get used to sloppy HTML because browsers are forgiving, but XML is stricter and doesn’t allow improper nesting or unquoted attributes.
HTML content embedded unquoted in the XML. This can be done legally, but the content has to be valid XHTML, else it breaks the feed. If in doubt, wrap CDATA around your HTML.
Incorrectly escaped text. It’s not hard to XML-escape text, but people managed to screw it up. Get it wrong one way and it breaks the XML; or you can double-escape and then users will see garbage like “"” in titles and content.
Bad text encoding. Not declaring the encoding and making us guess! Stating one encoding but using another! An especially “fun” prank was to use UTF-8 for most of the feed but have the content be something else like ISO-8859.
Badly-formatted dates. This was a whole subcategory … using the wrong date format, or localized month names, or omitting the time zone, or other more creative mistakes.
Not using entry UUIDs and then changing the article URLs. Caused lots of complaints like “the reader marked all the articles unread again!”
Serving the feed as dynamic content without a Last-Modified or Etag header. Not technically a mistake, but hurts performance on both sides due to extra bandwidth and the time to generate and parse.
Fortunately you can detect nearly all these by running the feed through a validator. Do this any time you edit your generator code/template.
For anyone wanting to write a feed reader: you’ll definitely want something like libTidy, which can take “tag soup” and turn it into squeaky clean markup. Obviously important for the XML, but also for article HTML if you plan on embedding it inside a web page — otherwise errors like missing close tags can destroy the formatting of the enclosing page. LibTidy also improves security by stripping potentially dangerous stuff like scripts.
The one thing in this article I disagree with is the suggestion to use CSS to style article content. It’s bad aesthetically because your articles will often be shown next to articles from other feeds, and if every article has its own fonts and colors it looks like a mess. Also, I think most readers will just strip all CSS (we did) because there are terrible namespace problems when mixing unrelated style sheets on the same page.
PS: For anyone doing research on historical tech flame wars, out-of-control bikeshedding, and worst-case scenarios of open data format design — the “feed wars” of the early/mid Oughts are something to look at. Someone (Mark Pilgrim?) once identified no less than eleven different incompatible versions of RSS, some of which didn’t even have their own version numbers because D*ve W*ner used to like to make changes to the RSS 2.0 “spec” (and I use that term loosely) without bumping the version.
Someone (Mark Pilgrim?) once identified no less than eleven different incompatible versions of RSS,
I suspect this is because of intense commitment to the robustness principle (Postel’s Law). Tim Bray rebutted Dave Winer and Aaron Swartz’s frankly goofy devotion to this idea. I think it’s better to follow Bray’s advice.
Actually it was Pilgrim and Aaron Schwartz he was rebutting in that blog post, not Winer.
And the 11-different-versions madness had nothing to do with liberal parsers, but with custody battles, shoehorning very different formats under the same name (RDF vs non-RDF), Winer’s allergy to writing a clear detailed spec or at least versioning his changes to it, and various other people’s ego trips.
In my experience, writing a liberal parser was a necessity because large and important feed publishers were among those serving broken feeds, and when your client breaks on a feed, users blame you, said users including your employer’s marketing department. Web browsers have always been pretty liberal for this reason.
There’s one good alternative to using UUIDs: tag URIs. They have one benefit over UUIDs: they’re human readable.
I remember the feed wars! Winer’s petulance caused so much damage. I haven’t used anything but Atom since then for anything I publish, and I advise people to give the various flavours of RSS a wide berth.
While Atom is a better choice these days, RSS1 is pretty well supported and a better design than the insanity of XML misunderstanding/abuse that is RSS2.
Yeah, there doesn’t seem to be any downside to Atom, whereas RSS is just… I don’t know how to say it, but if your primary goal is to serve a feed of web pages, and there’s not a well-specified and sane way to include HTML, there’s something wrong with your spec.
I only implemented Atom for my blog, and would recommend the same to others.
RSS2 doesn’t even have a namespace so it doesn’t mix with other XML well. It also has no officially specified way to include HTML content, with several competing extensions for it, all bad.
I understand. Yes, those criticisms are quite true. Thanks for sharing :-) I’m offering both RSS and ATOM on my blog, so I guess each user will pick whatever they prefer.
Interesting article. I almost skipped over it because of its name mentioning a legacy technology (like if it had talked about TLS by mentioning “SSL certificates”).
Does anybody know if broadcasting Atom feeds has any impact on Search Engine Optimization?
RSS seems to be the colloquial term for feeds in general. And “RSS Feeds” gets people thinking about the right thing whereas “Feeds” is a little too generic.
Google at least will subscribe to your feed and treat it similarly to a sitemap. I don’t know if this directly affects your ranking, I would assume not or at least not much.
This is an excellent resource! I worked on a feed reader from 2003-2007, and broken feeds were a constant annoyance. A lot of this seemed to be caused by generating the feed using the same template engine as the HTML but not taking account of the fact that it’s supposed to be XML
I hope the situation is better now, but the major mistakes I saw then were:
"
” in titles and content.Fortunately you can detect nearly all these by running the feed through a validator. Do this any time you edit your generator code/template.
For anyone wanting to write a feed reader: you’ll definitely want something like libTidy, which can take “tag soup” and turn it into squeaky clean markup. Obviously important for the XML, but also for article HTML if you plan on embedding it inside a web page — otherwise errors like missing close tags can destroy the formatting of the enclosing page. LibTidy also improves security by stripping potentially dangerous stuff like scripts.
The one thing in this article I disagree with is the suggestion to use CSS to style article content. It’s bad aesthetically because your articles will often be shown next to articles from other feeds, and if every article has its own fonts and colors it looks like a mess. Also, I think most readers will just strip all CSS (we did) because there are terrible namespace problems when mixing unrelated style sheets on the same page.
PS: For anyone doing research on historical tech flame wars, out-of-control bikeshedding, and worst-case scenarios of open data format design — the “feed wars” of the early/mid Oughts are something to look at. Someone (Mark Pilgrim?) once identified no less than eleven different incompatible versions of RSS, some of which didn’t even have their own version numbers because D*ve W*ner used to like to make changes to the RSS 2.0 “spec” (and I use that term loosely) without bumping the version.
I have unsubscribed from certain blogs because of this. It’s no fun when they keep “posting” the last 10 articles all the time…
It drives me mad when I occasionally update my feeds and suddenly have tens, or hundreds (!) of “new” articles.
Doesn’t happen often enough that I’d want to delete the feed, but still very annoying.
I suspect this is because of intense commitment to the robustness principle (Postel’s Law). Tim Bray rebutted Dave Winer and Aaron Swartz’s frankly goofy devotion to this idea. I think it’s better to follow Bray’s advice.
Actually it was Pilgrim and Aaron Schwartz he was rebutting in that blog post, not Winer.
And the 11-different-versions madness had nothing to do with liberal parsers, but with custody battles, shoehorning very different formats under the same name (RDF vs non-RDF), Winer’s allergy to writing a clear detailed spec or at least versioning his changes to it, and various other people’s ego trips.
In my experience, writing a liberal parser was a necessity because large and important feed publishers were among those serving broken feeds, and when your client breaks on a feed, users blame you, said users including your employer’s marketing department. Web browsers have always been pretty liberal for this reason.
Oh, right. Typed the wrong name there. Not gonna go back and edit it, though.
There’s one good alternative to using UUIDs: tag URIs. They have one benefit over UUIDs: they’re human readable.
I remember the feed wars! Winer’s petulance caused so much damage. I haven’t used anything but Atom since then for anything I publish, and I advise people to give the various flavours of RSS a wide berth.
While Atom is a better choice these days, RSS1 is pretty well supported and a better design than the insanity of XML misunderstanding/abuse that is RSS2.
Yeah, there doesn’t seem to be any downside to Atom, whereas RSS is just… I don’t know how to say it, but if your primary goal is to serve a feed of web pages, and there’s not a well-specified and sane way to include HTML, there’s something wrong with your spec.
I only implemented Atom for my blog, and would recommend the same to others.
What do you dislike about RSS 2.0? I just implemented it here, and it was no harder than what I remember from RSS 1.0…
RSS2 doesn’t even have a namespace so it doesn’t mix with other XML well. It also has no officially specified way to include HTML content, with several competing extensions for it, all bad.
I understand. Yes, those criticisms are quite true. Thanks for sharing :-) I’m offering both RSS and ATOM on my blog, so I guess each user will pick whatever they prefer.
This was a really nice article. I’ll be following suit with an atom feed soon.
Interesting article. I almost skipped over it because of its name mentioning a legacy technology (like if it had talked about TLS by mentioning “SSL certificates”).
Does anybody know if broadcasting Atom feeds has any impact on Search Engine Optimization?
RSS seems to be the colloquial term for feeds in general. And “RSS Feeds” gets people thinking about the right thing whereas “Feeds” is a little too generic.
Google at least will subscribe to your feed and treat it similarly to a sitemap. I don’t know if this directly affects your ranking, I would assume not or at least not much.