[UA-discuss] Cross-script Homoglyphs

Stuart Stuple stuartst at microsoft.com
Mon Apr 24 13:57:11 UTC 2017


This is an amazing effort. Thank you. Definitely going into my references category.

Is there perhaps a way for the infrastructure to limit to a smaller set of Unicode based on living languages (those with active readers above a specific value)? That would eliminate some of the code points. Put another way, is the intent of universal acceptance to support all languages used today in the world or all code points in Unicode?

A supplemental aspect to consider in isolating the problem space might be whether the font(s) used for the address bar should have extra effort taken to ensure that these are not confusable. That might help in cases where the instances are in the same script. You can see an example of this with the capital i, lowercase L, and the one in the Edge (IE) font address bar (though the Il difference could / should likely be greater). Is there any value in pursuing this?

-Stuart

-----Original Message-----
From: ua-discuss-bounces at icann.org [mailto:ua-discuss-bounces at icann.org] On Behalf Of Asmus Freytag
Sent: Sunday, April 23, 2017 5:43 PM
To: ua-discuss at icann.org
Subject: Re: [UA-discuss] Cross-script Homoglyphs

All,

I've started to put together a list of tentative cross script homoglyphs.

This is based partially on tables published by Unicode and data found in LGR proposals for the Root Zone. I've augmented the set with some of my own research. I've also indicated whether a code point was considered to be in likely widespread modern use as indicated by its inclusion into the Maximal Starting Repertoire for the Root Zone, MSR-2.

Unicode's data cover code points that are "intentional" (that is expected to look the same). Unfortunately for anyone working with IDNA 2008, they contain a lot of irrelevant entries (which might be useful for other types of identifiers, perhaps) and they are presented in a format that requires knowing the NFD decomposition for all code points; easy for an algorithm, difficult for human reviewers working off IDNA
2008 PVALID lists (which are in NFC, that is composed).

Finally, there are some curious omissions in the data.  (Unicode publishes a rather larger list of "confusables", but my take is that there, the signal to noise ratio is unfavorable). I have removed the few items that constituted pure in-script duplication.

The LGR data contain some additional suggested homoglyphs. Some of these are not as purely "intentional" as the Unicode set, but as they have been reviewed by the relevant communities, I've added them here.

I have not added homoglyphs across script boundaries but inside multi-script writing systems, like the homoglyphs set of code points that link Hiragana and Katakana. However, it might be useful to add the set of Kana to Han homoglyphs (because they might be usable to spoof Chinese-only domains). Whether or not that is useful is one of the questions I hope to get answered by sharing the collection at this stage. (So far, they are not listed).

Comments welcome,

A./

PS: I have, as of yet, not provided the full listing of homoglyph relations where one script has a precomposed a code point and the other has a combining sequence. (Many of these code points are not in the widest use, so they do not constitute a priority).

PPS: I'm using an RFC7940-based tool suite, so the result is formatted and looks like an LGR, but that's not the point. It's just like the guy with the hammer, to whom everything looked like a nail. The match is not really that bad in this case; the formatting gives some nice freebies like automatic display of Unicode names, script values and the like. 
However, a final collection might look very different, so I request feedback on the contents and scope, not the layout.



More information about the UA-discuss mailing list