Note: Adding the lang attribute is particularly important for assistive technology users. For instance, screen readers alter their voice and pronunciation based on the language attribute.
I’m sorry, it is the year 2023. It is trivial to identify the language of a paragraph of text, and, if you fail and just use the default voice, any screen reader user will be either a) as confused as I would be, reading a language I clearly don’t understand, or b) able to determine that they are getting German with a bad Spanish accent, assuming they speak both languages. Please, please, please, accessibility “experts”, stop asking literally millions of people to do work on every one of their pieces of content, when the work can be done trivially, automatically.
These are heursics, and not always correct. Especially for shorter phrases it is very possible that it is valid in multiple languages. I think it is of course good they threse heuristics exist but it seems that it is best to also provide more concrete info.
The ideal situation is probably both. Treat the HTML tags as a strong signal, but if there is lots of text and your heuristics are fairly certain that it is wrong consider overriding it, but if it is short text or you aren’t sure go with what it says.
Makes me wonder if there is a way to indicate “I don’t know” for part of the text. For example if I am embedding a user-submitted movie title that may be another language. I could say that most of this site is in English, but I don’t know what language that title is, take your best guess.
How does one indicate undetermined languages using the ISO 639 language codes?
In some situations, it may be necessary to indicate that the identity of the language used in an information object has not been determined. If the situation is that it is undetermined because there is no language content, the following identifier is provided by ISO 639-2:
zxx (No linguistic content; Not applicable)
If there is language content, but the specific language cannot be determined a special identifier is provided by ISO 639-2:
und (Undetermined)
Also in fun ISO language codes: You can add -fonipa to a language code to indicate IPA transcription:
It is trivial to identify the language of a paragraph of text
It’s an AGI-hard problem…
Consider my cousin Ada. The only way a screen reader (or person) can read that sentence correctly without a <span lang=tr> is by knowing who she is.
What is possible, though far from trivial, is to apply a massive list of heuristics, which is sometimes the best option available, i.e. user-generated content. However, When people who do have the technical knowledge to take care of these things don’t, responsible authors who mark their languages will then have to work around them.
But never, in all of human history, has a letter, or book, or magazine article ever noted your cousin’s name language in obscure markup. That’s not how humans communicate, and we shouldn’t start now.
I write lang="xy" attributes. I, for one, certainly would prefer that the relatively small number of HTML authors take the small amount of care to write lang="xy" attributes, so that user agents can simply read those nine bytes, than that the much larger number of users spend the processing power to run the heuristics to identify the language (and maybe fail to guess correctly). Consider users over authors. Maybe, if one considers only screen readers, the effect shrinks away, but there are other user agents that care in what language is the text on the Web, as common as Google Chrome, which identifies the language so that it can offer to Google-Translate it.
I, for one, certainly would prefer that the relatively small number of HTML authors take the small amount of care to write lang=“xy” attributes, so that user agents can simply read those nine bytes, than that the much larger number of users spend the processing power to run the heuristics to identify the language (and maybe fail to guess correctly).
This is the fundamental disconnect. You are not making this ask of the “relatively small number of HTML authors”. You are making this ask of literally every single person who tweets, posts to facebook or reddit, or sends an email. This is an ask of, essentially, every person who has ever used a computer. The content creator is the only person who knows the language they are using.
Nice! I knew the last two, but I don’t think I’ve ever heard of the first three before. I’ll remember this translate property. I think that’s an interesting one.
I didn’t know about it either. But do any browsers do anything with it? It would be really cool if they gave you some sort of indication “this page is also available in English, [View]”. But I’ve never seen something like that. It is always hunting around for an in-page “English”, or “En” or English in the source language, or a UK flag, or a US flag.
Sorry, I should have been more specific. I was talking about <link href="https://example.com/de" rel="alternate" hreflang="de" /> where there is a first-party translation available. Not triggering machine-translation.
I’m sorry, it is the year 2023. It is trivial to identify the language of a paragraph of text, and, if you fail and just use the default voice, any screen reader user will be either a) as confused as I would be, reading a language I clearly don’t understand, or b) able to determine that they are getting German with a bad Spanish accent, assuming they speak both languages. Please, please, please, accessibility “experts”, stop asking literally millions of people to do work on every one of their pieces of content, when the work can be done trivially, automatically.
These are heursics, and not always correct. Especially for shorter phrases it is very possible that it is valid in multiple languages. I think it is of course good they threse heuristics exist but it seems that it is best to also provide more concrete info.
The ideal situation is probably both. Treat the HTML tags as a strong signal, but if there is lots of text and your heuristics are fairly certain that it is wrong consider overriding it, but if it is short text or you aren’t sure go with what it says.
Makes me wonder if there is a way to indicate “I don’t know” for part of the text. For example if I am embedding a user-submitted movie title that may be another language. I could say that most of this site is in English, but I don’t know what language that title is, take your best guess.
From https://www.loc.gov/standards/iso639-2/faq.html#25:
Also in fun ISO language codes: You can add
-fonipato a language code to indicate IPA transcription:From my resume:
It’s an AGI-hard problem…
Consider my cousin Ada. The only way a screen reader (or person) can read that sentence correctly without a
<span lang=tr>is by knowing who she is.What is possible, though far from trivial, is to apply a massive list of heuristics, which is sometimes the best option available, i.e. user-generated content. However, When people who do have the technical knowledge to take care of these things don’t, responsible authors who mark their languages will then have to work around them.
But never, in all of human history, has a letter, or book, or magazine article ever noted your cousin’s name language in obscure markup. That’s not how humans communicate, and we shouldn’t start now.
I write
lang="xy"attributes. I, for one, certainly would prefer that the relatively small number of HTML authors take the small amount of care to writelang="xy"attributes, so that user agents can simply read those nine bytes, than that the much larger number of users spend the processing power to run the heuristics to identify the language (and maybe fail to guess correctly). Consider users over authors. Maybe, if one considers only screen readers, the effect shrinks away, but there are other user agents that care in what language is the text on the Web, as common as Google Chrome, which identifies the language so that it can offer to Google-Translate it.This is the fundamental disconnect. You are not making this ask of the “relatively small number of HTML authors”. You are making this ask of literally every single person who tweets, posts to facebook or reddit, or sends an email. This is an ask of, essentially, every person who has ever used a computer. The content creator is the only person who knows the language they are using.
[Comment removed by author]
Nice! I knew the last two, but I don’t think I’ve ever heard of the first three before. I’ll remember this translate property. I think that’s an interesting one.
I didn’t know about it either. But do any browsers do anything with it? It would be really cool if they gave you some sort of indication “this page is also available in English, [View]”. But I’ve never seen something like that. It is always hunting around for an in-page “English”, or “En” or English in the source language, or a UK flag, or a US flag.
I assumed that it skips the contents when you click the “Translate…” button on mobile Chrome (and whatever other browsers have this feature).
Sorry, I should have been more specific. I was talking about
<link href="https://example.com/de" rel="alternate" hreflang="de" />where there is a first-party translation available. Not triggering machine-translation.Oh, yeah. I have never actually made a multi-language website. So, I have no idea!