[UA-discuss] Cross-script Homoglyphs
Asmus Freytag (c)
asmusf at ix.netcom.com
Mon Apr 24 15:41:05 UTC 2017
On 4/24/2017 6:57 AM, Stuart Stuple wrote:
> This is an amazing effort. Thank you. Definitely going into my references category.
thanks, this is far from finalized, so I hope I can get feedback on the
concept and its details and then update you all with a revision.
> Is there perhaps a way for the infrastructure to limit to a smaller set of Unicode based on living languages (those with active readers above a specific value)? That would eliminate some of the code points. Put another way, is the intent of universal acceptance to support all languages used today in the world or all code points in Unicode?
This is an excellent question.
The root zone LGR project is designed to support as many languages as is
feasible. There are no easily available data on the use of writing
systems, but SIL's EGIDS is a good proxy: it gives the status of a
language based on the mode of transmission, the degree to which it is
supported by public and private institutions for example.
Unfortunately, that resource just disappeared behind a paywall.
> A supplemental aspect to consider in isolating the problem space might be whether the font(s) used for the address bar should have extra effort taken to ensure that these are not confusable. That might help in cases where the instances are in the same script. You can see an example of this with the capital i, lowercase L, and the one in the Edge (IE) font address bar (though the Il difference could / should likely be greater). Is there any value in pursuing this?
Consolas, to give one example, makes many distinctions not found in
ordinary text. But would users expect to see them?
However, even that font does not make all distinctions (and I'm not sure
how many scripts it supports). Some of the code points I included in my
list are not distinguished by it. Ohters were only distinguished by it.
As long as one can't reliably predict that all (or at least
overwhelmingly most) users can see a certain distinction (that is,
actually get presented with glyphs that differ) then the conservative
approach would be to treat the code points in question as homoglyphs.
> -----Original Message-----
> From: ua-discuss-bounces at icann.org [mailto:ua-discuss-bounces at icann.org] On Behalf Of Asmus Freytag
> Sent: Sunday, April 23, 2017 5:43 PM
> To: ua-discuss at icann.org
> Subject: Re: [UA-discuss] Cross-script Homoglyphs
> I've started to put together a list of tentative cross script homoglyphs.
> This is based partially on tables published by Unicode and data found in LGR proposals for the Root Zone. I've augmented the set with some of my own research. I've also indicated whether a code point was considered to be in likely widespread modern use as indicated by its inclusion into the Maximal Starting Repertoire for the Root Zone, MSR-2.
> Unicode's data cover code points that are "intentional" (that is expected to look the same). Unfortunately for anyone working with IDNA 2008, they contain a lot of irrelevant entries (which might be useful for other types of identifiers, perhaps) and they are presented in a format that requires knowing the NFD decomposition for all code points; easy for an algorithm, difficult for human reviewers working off IDNA
> 2008 PVALID lists (which are in NFC, that is composed).
> Finally, there are some curious omissions in the data. (Unicode publishes a rather larger list of "confusables", but my take is that there, the signal to noise ratio is unfavorable). I have removed the few items that constituted pure in-script duplication.
> The LGR data contain some additional suggested homoglyphs. Some of these are not as purely "intentional" as the Unicode set, but as they have been reviewed by the relevant communities, I've added them here.
> I have not added homoglyphs across script boundaries but inside multi-script writing systems, like the homoglyphs set of code points that link Hiragana and Katakana. However, it might be useful to add the set of Kana to Han homoglyphs (because they might be usable to spoof Chinese-only domains). Whether or not that is useful is one of the questions I hope to get answered by sharing the collection at this stage. (So far, they are not listed).
> Comments welcome,
> PS: I have, as of yet, not provided the full listing of homoglyph relations where one script has a precomposed a code point and the other has a combining sequence. (Many of these code points are not in the widest use, so they do not constitute a priority).
> PPS: I'm using an RFC7940-based tool suite, so the result is formatted and looks like an LGR, but that's not the point. It's just like the guy with the hammer, to whom everything looked like a nail. The match is not really that bad in this case; the formatting gives some nice freebies like automatic display of Unicode names, script values and the like.
> However, a final collection might look very different, so I request feedback on the contents and scope, not the layout.
More information about the UA-discuss