Not as crazy as it seems. Some of the rules are in IDNA, some are elsewhere.
If you try to actually register one of those crazy domains, you’ll tend to be blocked because the registry has a set of rules called label generation rules that govern what codepoints can be used by whom.
There’s a set of label generation rules for the root zone, which is very elaborate and prevents anyone from registering a new TLD using, say, the cyrillic letters that look like .com. You’re also prevented from doing that by other things, but the RZ-LGR ruleset exists and are the formal fundament for rejecting such a domain application.
There are also LGRs for each registry or each top-level domain. Each registry decides on its own what rules to use, most of them decide on a subset of what ICANN recommends. ICANN recommends that registries allow a subset of 39 (IIRC) scripts and suggests rules for each, most registries/TLDs allow only a few of those 39.
The rules for each cover visual appearance and more. They say things like “can’t put arabic punctuation between latin letters” and “latin r is like these other things and a domain registrant gets a unique right to all lookalike domains”. If IBM has registered “ibm” in a particular TLD, other registrants can’t use lookalike glyphs to get a lookalike domain.
When Daniel writes that “supposedly, all of those combinations can be used as IDN names and they will work” he’s right, except that if you want to use them to mimic IBM’s domains, IBM got there first and has a unique right to “your” lookalike, and if you want to use a codepoint outside the 39 considered scripts, your attempt is blocked since that script hasn’t yet been considered.
Some TLDs don’t use LGRs. I think .ws is one of those, .tokyo maybe too. There’s also a case where a TLD registry wants to use LGRs but is caught in a web of old contracts and the upgrade to the currently recommended rule just isn’t happening. I got a runic domain in that TLD while tt’s possible (runic will never get one of those lookalike consideraton committees, so the entire script is banned in most TLDs).
And on top of the rules registries impose, registrars can impose additional ones, as some registries have poorly thought out codepoint whitelists (cough Verisign cough), and narrowing down the target language automatically can avoid the need to ask the registrant what the language is (many registries expect a language code to be sent in an extension block of the EPP create request). The less you need to ask the registrant, the better.
Tbh this seems like something browsers should be on top of - flagging urls that switch between Idn and non-idn, and showing you both versions of the url
Chrome statrted doing something like that in version 51, I think the others started around the same time. It doesn’t matter very much. Chrome’s ruleset mostly allows every domain you can register and blocks domains you can’t. (Of course you can take your own domain, make all kinds of subdomains and attract Chrome’s displeasure, but AFAICT noone cares about that in terms of either security or usabilitty.)
The real underlying problem here is the number of systems and protocols that still, today, in 2022, are so hopelessly ASCII-oriented that the best we can do is try to find ways to translate everything into ASCII for those systems to work with.
While there certainly are things one could validly gripe about in Unicode and/or IDNA, focusing on Unicode and IDNA is unproductive. For example: much of what he lists under “Heterograph?” is ultimately traceable to things like compatibility-equivalence rules.
For the unfamiliar: Unicode distinguishes between canonical equivalence and compatibility equivalence. Two sequences of code points have canonical equivalence if they “represent the same abstract character and … when correctly displayed should always have the same visual appearance and behavior”. Compatibility equivalence is broader, doesn’t require identical visual appearance, and allows that the sequences might only be fully interchangeable in certain contexts.
Examples:
U+00E9 and U+0065 U+0301 have canonical equivalence. They both look like this: é. One is the composed single-code-point form, and the other is the decomposed form using the letter and a combining accent.
U+00BD and U+0031 U+2044 U+0032 have compatibility equivalence. The first one is ½ and the second one is 1⁄2. They both are ways of writing the fraction “one half”, but are not visually identical.
One big use of compatibility equivalence is transliterating “ASCII-ish” things into ASCII, for use by systems that don’t or won’t accept any richer text abstraction.
Not as crazy as it seems. Some of the rules are in IDNA, some are elsewhere.
If you try to actually register one of those crazy domains, you’ll tend to be blocked because the registry has a set of rules called label generation rules that govern what codepoints can be used by whom.
There’s a set of label generation rules for the root zone, which is very elaborate and prevents anyone from registering a new TLD using, say, the cyrillic letters that look like .com. You’re also prevented from doing that by other things, but the RZ-LGR ruleset exists and are the formal fundament for rejecting such a domain application.
There are also LGRs for each registry or each top-level domain. Each registry decides on its own what rules to use, most of them decide on a subset of what ICANN recommends. ICANN recommends that registries allow a subset of 39 (IIRC) scripts and suggests rules for each, most registries/TLDs allow only a few of those 39.
The rules for each cover visual appearance and more. They say things like “can’t put arabic punctuation between latin letters” and “latin r is like these other things and a domain registrant gets a unique right to all lookalike domains”. If IBM has registered “ibm” in a particular TLD, other registrants can’t use lookalike glyphs to get a lookalike domain.
When Daniel writes that “supposedly, all of those combinations can be used as IDN names and they will work” he’s right, except that if you want to use them to mimic IBM’s domains, IBM got there first and has a unique right to “your” lookalike, and if you want to use a codepoint outside the 39 considered scripts, your attempt is blocked since that script hasn’t yet been considered.
There are workarounds, of course — googIe.com with an upper-case i, ameriprisẹ.com with a speck under the e, or plain old googlesecurityteam4321423@yahoo.net. But IDN is not a weak point, it’s better defended than most.
Some TLDs don’t use LGRs. I think .ws is one of those, .tokyo maybe too. There’s also a case where a TLD registry wants to use LGRs but is caught in a web of old contracts and the upgrade to the currently recommended rule just isn’t happening. I got a runic domain in that TLD while tt’s possible (runic will never get one of those lookalike consideraton committees, so the entire script is banned in most TLDs).
And on top of the rules registries impose, registrars can impose additional ones, as some registries have poorly thought out codepoint whitelists (cough Verisign cough), and narrowing down the target language automatically can avoid the need to ask the registrant what the language is (many registries expect a language code to be sent in an extension block of the EPP create request). The less you need to ask the registrant, the better.
Tbh this seems like something browsers should be on top of - flagging urls that switch between Idn and non-idn, and showing you both versions of the url
Chrome statrted doing something like that in version 51, I think the others started around the same time. It doesn’t matter very much. Chrome’s ruleset mostly allows every domain you can register and blocks domains you can’t. (Of course you can take your own domain, make all kinds of subdomains and attract Chrome’s displeasure, but AFAICT noone cares about that in terms of either security or usabilitty.)
The real underlying problem here is the number of systems and protocols that still, today, in 2022, are so hopelessly ASCII-oriented that the best we can do is try to find ways to translate everything into ASCII for those systems to work with.
While there certainly are things one could validly gripe about in Unicode and/or IDNA, focusing on Unicode and IDNA is unproductive. For example: much of what he lists under “Heterograph?” is ultimately traceable to things like compatibility-equivalence rules.
For the unfamiliar: Unicode distinguishes between canonical equivalence and compatibility equivalence. Two sequences of code points have canonical equivalence if they “represent the same abstract character and … when correctly displayed should always have the same visual appearance and behavior”. Compatibility equivalence is broader, doesn’t require identical visual appearance, and allows that the sequences might only be fully interchangeable in certain contexts.
Examples:
U+00E9
andU+0065 U+0301
have canonical equivalence. They both look like this:é
. One is the composed single-code-point form, and the other is the decomposed form using the letter and a combining accent.U+00BD
andU+0031 U+2044 U+0032
have compatibility equivalence. The first one is½
and the second one is1⁄2
. They both are ways of writing the fraction “one half”, but are not visually identical.One big use of compatibility equivalence is transliterating “ASCII-ish” things into ASCII, for use by systems that don’t or won’t accept any richer text abstraction.