[UA-discuss] Fw: Re: IDN Implementation Guidelines [RE: Re : And now about phishing...]

Mark Svancarek marksv at microsoft.com
Mon Apr 24 16:11:08 UTC 2017


This is a good taxonomy and UASG should use it consistently.


- identical              renders same in all fonts and sizes

- near identical         may not be perfectly identical but almost

- not reliably distinct  may be distinct if shown side by side or in some contexts

- confusingly similar    close enough that it can get misidentified



- similar               everything else that is not clearly distinct

Tl;dr Has anyone made an effort to formally assign and document these definitions to the various pairs of codepoints?

From: ua-discuss-bounces at icann.org [mailto:ua-discuss-bounces at icann.org] On Behalf Of Asmus Freytag
Sent: Saturday, April 22, 2017 11:56 AM
To: ua-discuss at icann.org
Subject: Re: [UA-discuss] Fw: Re: IDN Implementation Guidelines [RE: Re : And now about phishing...]

On 4/22/2017 9:16 AM, Andrew Sullivan wrote:

On Sat, Apr 22, 2017 at 01:32:08PM +0000, nalini.elkins at insidethestack.com<mailto:nalini.elkins at insidethestack.com> wrote>

For example, you may wish to see the following permutations which have already been obtained.  (And, it appears not by Apple)



www.applé.com<https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.appl%C3%A9.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=7AeEjwBOAUuTZni%2BtlW2pglIHJWziZUmFXBVkj8tmu0%3D&reserved=0>   www.xn--appl-epa.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-epa.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=rzozQnV9CmdvL0GNeA86jF%2FxQd%2FPH5Nlx4P6XdMbpeI%3D&reserved=0>   www.xn--appl-epa.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-epa.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=rzozQnV9CmdvL0GNeA86jF%2FxQd%2FPH5Nlx4P6XdMbpeI%3D&reserved=0>

www.applê.com<https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.appl%C3%AA.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=XmA9oTgfjmMMENGBpHAt%2BPiNS265Nf2To0BKQCt1YFw%3D&reserved=0>   www.xn--appl-jpa.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-jpa.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=ZbRL0pJWEpCEx3mu7VAZUcY71F34EUEjbMpQitt6hm0%3D&reserved=0>    www.xn--appl-jpa.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-jpa.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=ZbRL0pJWEpCEx3mu7VAZUcY71F34EUEjbMpQitt6hm0%3D&reserved=0>

www.applė.com<https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.appl%C4%97.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=ElC%2BhWkrOcc0g8ymd9cwmFN5IYYMjcHLbD0nPzBdFJs%3D&reserved=0>   www.xn--appl-yva.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-yva.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502200236&sdata=H4Fexf%2BPXWpX5N8Mz3537rn%2BKEiI94At1dLO3DbyA8U%3D&reserved=0>   www.xn--appl-yva.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-yva.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502210244&sdata=m3KBHt5D6gvVhvoz3UN0OF0JuVf7DYRnNdTkVmr7U20%3D&reserved=0>

www.applę.com<https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.appl%C4%99.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502210244&sdata=2RzdtTxXDhh7FgDDKcUDaFry1pKgNPBH7terxnwtVds%3D&reserved=0>   www.xn--appl-8va.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-8va.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502210244&sdata=NhOdoGYZLiqqty40SpYdMJzfba2FtQSVRm0ZbBxf%2F4o%3D&reserved=0>   www.xn--appl-8va.com<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xn--appl-8va.com&data=02%7C01%7Cmarksv%40microsoft.com%7Cc811d8abe0db425bfea708d489b1306f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636284841502210244&sdata=NhOdoGYZLiqqty40SpYdMJzfba2FtQSVRm0ZbBxf%2F4o%3D&reserved=0>



Do you think that those qualify as "homographs"?  I suppose they

might, as might àpple.com and so on, but these at least don't seem to

me to be any different than app1e.com, which we decided long ago was

Apple's problem and nobody else's.

The claim that letters with diacritics are homoglyphs of the undecorated letters is
rather tenuous. There are some words where diacritics are optional in some languages,
and for those words, even a native reader may not notice the difference in spelling,
but that is not all that different from "color" vs. "colour".

A slightly stronger case might be made that some diacritics are not reliably
distinguished from each other. Cedilla and comma below come to mind. There are
plenty of examples, mostly in print, that show the use of one in place of the other,
even if the language ostensibly calls for a specific one and not the other. Their shapes
are not so different that the substitution would always be jarring.

Extending that, it's generally the diacritics below that are less readily distinguished
from each other; partially that's because there's less real estate in the glyph (and the
bottom of the line may be clipped by the following line of text if the spacing is too
tight).

The real issue with diacritics is when multiple ones are applied to a base letter.
While they are supposed to stack neatly outward in the order that they are entered,
there isn't room enough at the bottom of the glyph to show that reliably. Also,
sometimes diacritics "overprint" instead of stacking.

From a perspective of making IDNs universally acceptable, registries should be
encouraged to restrict the use of dual diacritics unless central to the writing system,
as it is for Vietnamese. The Unicode principle of allowing all combinations is fine
for general texts (or more likely, academic texts), but misplaced for identifiers.








This is quite different to the case of true homoglyphs of the sort

that Asmus is talking about, where the very same glyph is normally

used in two different scripts such that nobody would be able to tell

the difference.  One maybe could argue that "аррӏе" is pure homoglyphs

(0430,0440,0440,04CF, 0435), but I think it's tough to argue for it.

"арр" / "app" or "аре" / "ape" would be true homographs (always identical),
but the palochka (04CF - "ӏ") is at best a near homoglyph. It renders
identical in many fonts, even though it can be rather distinct in
others (especially certain console fonts).

[cid:image001.png at 01D2BCDA.B4F2F740]

Since those console fonts are a minority, the Palochka should probably be treated
conservatively as a homoglyph.








Remember, the IDNA rules are really _quite_ restrictive, and if

registries also require "same script per label" those restrictions

catch an _awful_ lot of corner cases (that was the outcome of the

"paypal" controversy some time ago).



If you want to argue that policy should be different, that's fine, but

it seems to me to require some PDP within ICANN.  Note that ICANN is

probably going to propose some rules for variant handling, and

combined with the LGR stuff that is working its way through the system

we may find an awful lot of stuff is blocked.

Actually, you can block "an awful lot" and yet not affect the universe of valid
labels very much.

A case in point is Ethiopic, which for linguistic reasons I won't go into, is best
handled by treating a number of code points internal to the script as variants of
each other (making them mutually exclusive if they occur as alternates in otherwise
identical labels).

The linguistic reasons apply, strictly speaking, only to one of the languages (albeit
the most prominent one). The concern was raised how that would interfere with
the ability to register labels in other languages.

For the TLD IDN project we ran an analysis over a corpus of unique words, separated
by language. We found that the reduction in available labels was much less than
one might have guessed from the considerable number of variants. The reductions
came to a few percent, much less than the effect of the languages sharing words
that happened to be spelled the same (e.g. like English/French "but" or Englis/German
"also").

The main reason for this is that legitimate labels would tend to contain at least one
distinct code point, enough to prevent the label from being blocked. (Just as
"dapple" is no longer a homograph of any Cyrillic string even if "apple" is.)

The other reason is many strings that would otherwise be homographs do not
make sense in the other context, and therefore are much less likely to be used for
legitimate labels (and only used for phishing).

Just like the string "аррӏе" (using Cyrillic code points) makes no sense to anyone
using a language written in Cyrillic.

For short labels, acronyms etc. might increase the set of legitimate labels beyond
the word analysis we were doing, but it would be instructive to run such experiments
on Latin/Cyrillic corpora assuming the widest definition of homoglyph variants;
if those corpora are not limited to dictionary words, but include names, brands
and acronyms, that would yield a pretty reliable estimate.

I predict that the overlap will prove smaller than feared, for the same reasons
that we found for Ethiopic, some high visibility exceptions notwithstanding.







In any case, I think our purpose is very badly served by conflating

these two different kinds of issues.

Agreed. Our purpose is best served by making careful distinctions among these:

- identical              renders same in all fonts and sizes

- near identical         may not be perfectly identical but almost

- not reliably distinct  may be distinct if shown side by side or in some contexts

- confusingly similar    close enough that it can get misidentified



- similar               everything else that is not clearly distinct


My take is that there is a call for addressing the first two at the level of an LGR
(registry policy) and that the third requires some judgement call on whether it's
more like the first two, or more like the latter two.

A./








Best regards,



A




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20170424/78797f9e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 973 bytes
Desc: image001.png
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20170424/78797f9e/image001.png>


More information about the UA-discuss mailing list