[UA-discuss] Cross-script Homoglyphs
asmusf at ix.netcom.com
Mon Apr 24 00:42:59 UTC 2017
I've started to put together a list of tentative cross script homoglyphs.
This is based partially on tables published by Unicode and data found in
LGR proposals for the Root Zone. I've augmented the set with some of my
own research. I've also indicated whether a code point was considered to
be in likely widespread modern use as indicated by its inclusion into
the Maximal Starting Repertoire for the Root Zone, MSR-2.
Unicode's data cover code points that are "intentional" (that is
expected to look the same). Unfortunately for anyone working with IDNA
2008, they contain a lot of irrelevant entries (which might be useful
for other types of identifiers, perhaps) and they are presented in a
format that requires knowing the NFD decomposition for all code points;
easy for an algorithm, difficult for human reviewers working off IDNA
2008 PVALID lists (which are in NFC, that is composed).
Finally, there are some curious omissions in the data. (Unicode
publishes a rather larger list of "confusables", but my take is that
there, the signal to noise ratio is unfavorable). I have removed the few
items that constituted pure in-script duplication.
The LGR data contain some additional suggested homoglyphs. Some of these
are not as purely "intentional" as the Unicode set, but as they have
been reviewed by the relevant communities, I've added them here.
I have not added homoglyphs across script boundaries but inside
multi-script writing systems, like the homoglyphs set of code points
that link Hiragana and Katakana. However, it might be useful to add the
set of Kana to Han homoglyphs (because they might be usable to spoof
Chinese-only domains). Whether or not that is useful is one of the
questions I hope to get answered by sharing the collection at this
stage. (So far, they are not listed).
PS: I have, as of yet, not provided the full listing of homoglyph
relations where one script has a precomposed a code point and the other
has a combining sequence. (Many of these code points are not in the
widest use, so they do not constitute a priority).
PPS: I'm using an RFC7940-based tool suite, so the result is formatted
and looks like an LGR, but that's not the point. It's just like the guy
with the hammer, to whom everything looked like a nail. The match is not
really that bad in this case; the formatting gives some nice freebies
like automatic display of Unicode names, script values and the like.
However, a final collection might look very different, so I request
feedback on the contents and scope, not the layout.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the UA-discuss