[UA-discuss] Cross-script Homoglyphs

Mon Apr 24 00:42:59 UTC 2017

All,

I've started to put together a list of tentative cross script homoglyphs.

This is based partially on tables published by Unicode and data found in 
LGR proposals for the Root Zone. I've augmented the set with some of my 
own research. I've also indicated whether a code point was considered to 
be in likely widespread modern use as indicated by its inclusion into 
the Maximal Starting Repertoire for the Root Zone, MSR-2.

Unicode's data cover code points that are "intentional" (that is 
expected to look the same). Unfortunately for anyone working with IDNA 
2008, they contain a lot of irrelevant entries (which might be useful 
for other types of identifiers, perhaps) and they are presented in a 
format that requires knowing the NFD decomposition for all code points; 
easy for an algorithm, difficult for human reviewers working off IDNA 
2008 PVALID lists (which are in NFC, that is composed).

Finally, there are some curious omissions in the data.  (Unicode 
publishes a rather larger list of "confusables", but my take is that 
there, the signal to noise ratio is unfavorable). I have removed the few 
items that constituted pure in-script duplication.

The LGR data contain some additional suggested homoglyphs. Some of these 
are not as purely "intentional" as the Unicode set, but as they have 
been reviewed by the relevant communities, I've added them here.

I have not added homoglyphs across script boundaries but inside 
multi-script writing systems, like the homoglyphs set of code points 
that link Hiragana and Katakana. However, it might be useful to add the 
set of Kana to Han homoglyphs (because they might be usable to spoof 
Chinese-only domains). Whether or not that is useful is one of the 
questions I hope to get answered by sharing the collection at this 
stage. (So far, they are not listed).

Comments welcome,

A./

PS: I have, as of yet, not provided the full listing of homoglyph 
relations where one script has a precomposed a code point and the other 
has a combining sequence. (Many of these code points are not in the 
widest use, so they do not constitute a priority).

PPS: I'm using an RFC7940-based tool suite, so the result is formatted 
and looks like an LGR, but that's not the point. It's just like the guy 
with the hammer, to whom everything looked like a nail. The match is not 
really that bad in this case; the formatting gives some nice freebies 
like automatic display of Unicode names, script values and the like. 
However, a final collection might look very different, so I request 
feedback on the contents and scope, not the layout.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20170423/9477b4fe/cross-script-homoglyphs-01.html>