[UA-discuss] Fate of IDNA2008

Tue Mar 6 03:58:37 UTC 2018

There's been a flare-up of the discussion about what to do with IDNA2008 
(currently stuck at Unicode Version 6.3.0, compared to a current version 
of 10.0, with 11.0 upcoming).

Here's  a snapshot. (For more, see idna-update at ietf.org).

A./

PS: concerns related to emoji are touched upon as well.

-------- Forwarded Message --------
Subject: 	Re: [Idna-update] [Ext] FWD: Expiration impending: 
<draft-klensin-idna-rfc5891bis-01.txt>
Date: 	Mon, 5 Mar 2018 19:34:22 -0800
From: 	Asmus Freytag <asmusf at ix.netcom.com>
To: 	idna-update at ietf.org, Suzanne Woolf <suzworldwide at gmail.com>

Last summer, I spent some time doing a detailed analysis of modern
scripts (the ones that the lion's share of IDNs, other than some vanity
ones, will be sold in), and limited to the current approved subset
(Unicode 6.3.0). This survey surfaced a considerable fraction of
potentially "troublesome" characters; fundamentally not different from
the one that caused the halt in updating the IDNA208 tables. This survey
benefited from input beyond my own personal experience, including input
and data both from Unicode experts engaged in reviewing Unicode's tables
of code intentionally identical code points as well as expertise and
data generated as part of ICANN's Root Zone LGR project.

A draft version of the survey was made public as ID
https://datatracker.ietf.org/doc/draft-freytag-troublesome-characters/
The results were discussed ad-hoc with IETF and IAB experts at various
occasions and by e-mail. These discussions lead to some conclusions.

The following points are worth noting:

(1) A small, but significant number of both *existing* code points and
combining sequences exhibit the same issues as the code point that lead
to the IAB recommendation to halt the update of IDNA2008 tables. This
problem can be characterized as code points having identical (or near
identical) appearance with the code point differences not folded by
normalization. (The problem case found in 2015 only had "near identical"
appearance.)

(2) Objectively, halting the process did and does nothing about existing
"troublesome characters". Common to all is the issue that disallowing
repertoire elements is generally a poor way of mitigating the issues -
mainly because doing so would arbitrarily favor letters over digits, or
one language or writing system over another. The exception is that
disallowing some combining marks intended for more technical purposes
would prioritize regular text, and that would address a subset of these
cases. All others require different mitigation approaches well within
the scope of registration policies.

(3) For existing scripts, the pending additions are not expected to
significantly add to the existing problem. Compared to the number of
existing cases, the pending additions are few, and would be addressed
with the same mitigation approaches. (Pending additions are not
predominantly identical to existing code points; the vast majority are
added because they are, in fact, different, isolated exceptions
notwithstanding).

(4) For new scripts, John's observation that they are mostly not
modern-use means that problems are inherently limited by several
factors. Most of these scripts do not represent useful markets, as few
people can read them. The exception are decorative scripts like
hieroglyphics, which in turn have distinct appearance. For
non-decorative ones, some may exhibit code points that are accidentally
close in appearance to code points in other scripts. Whether there would
be font resources to present code points in these obsolete scripts is
doubtful; other attack vectors would promise easier success. (User
agents flagging unusual obsolete scripts would not create many false
positives.)

(5) The issue of non-normalized identical appearance pales in
significance compared to other issues with existing code points for
modern, widely used scripts. For Chinese and Arabic, not implementing
variants is arguably not state of the art; doing so allows unregulated
registration of confusing labels on a large scale. For all South East
Asian / Indic scripts, not implementing context rules for a good portion
(about a third) of the characters for each script is not state of the
art; doing so would allow labels that can't be interpreted by either
layout engines or readers, as in these scripts, characters do not exist
in isolation. For zones implementing multiple scripts, not blocking
cross-script homoglyphs is arguably also not state of the art; doing so
leads to whole-label confusables that are impossible to detect by
inspection.

(6) By freezing the update of IDNA2008 tables, IAB effectively declares
that IDNA2008 is "stuck in the past". This incrementally increases the
pressure on / temptation for various operators to unilaterally move
beyond IDNA2008. If such "wild catting" can cloak itself in the moral
mantle of support for some minority languages, it provides cover for
those cynically selling emoji labels.

(7) The only way forward is to re-synchronize the updates with the
current version of Unicode (11.0 is being readied and there will be a
new version every year). This must be coupled with simultaneously
addressing the gaps between "PVALID" and the state of the art.

(8) The ability to define blocked variants (code points that can be
substituted for each other by an unsuspecting user) and to robustly
exclude later registrations differing only by such a variant from an
earlier registration can take care of nearly all of the troublesome
characters, as well as of the problem of cross-script homoglyphs. This
is currently being implemented in the Root Zone for a growing list of
modern scripts (eventually 28 of them). Other zones can build on that
work (for example by integrating digits and hyphen, and optionally by
extending that analysis to scripts that "look alike" some modern ones).

(9) The security implications of not defining context rules for the
South East Asian/Indic scripts need to be publicized in a way that
non-specialists can understand the need and know where to look for
examples (such as the Root Zone LGRs) that successfully implement these.
(While IDNA2008 provides context rules for some special code points the
protocol level may not be the most appropriate place for general context
rules; some rules could be made stricter/looser for zones that are
limited to specific scripts/languages).

(10) The security implications of emoji are not understood and need to
be publicized in ways that is accessible to those that might be tempted
to flout the rules to get a "cute" label. There is enormous public
interest in emoji and it is spread across users of nearly all modern
scripts. Given the enormous pressure driving demand for these it is
frankly surprising how limited the wild catting has been up to now. The
concern would be that simply pointing to an existing standard
(especially one that artificially limits itself the the past - that is
Unicode 6.3.0) may not prove sufficient counter; there is a chance that
a clear explanation why emoji labels cannot be made secure, might give
at least some users pause.

(11) Finally, none of the issues discussed here exist in isolation. The
next layer of the onion is "similarity" or plain "confusable" labels. In
an environment where .com and .corn or apple.com and app1e.com may
legitimately coexist (as far as the protocol is concerned) and where
that situation is much worse in many scripts, it is simply questionable
that an action as drastic as anchoring the IDNA2008 protocol at an
arbitrary and (as we have seen) not problem-free point is justified or
in proportion.

This reality calls for a layered approach where each layer takes care of
its slice of the problem:

(a) The protocol takes care of all cases that can be remedied by
property-based inclusion applicable to all languages, and the occasional
context rule of global applicability. RFC 5891, 5892 etc.

(b) The next layer out consists of some tailoring of the repertoire for
each zone, but importantly, also implementing the state of the art when
it comes to blocking identical variants. This would rely on the
technology around LGRs and RFCs 7940 and 8228

(c) The outer layer would finally address the issue of the halo of
"confusables". There's room for new technologies allowing automated
string similarity scoring and review.

A./

On 3/5/2018 3:25 PM, John C Klensin wrote:
>
> --On Monday, March 5, 2018 16:23 -0500 Suzanne Woolf
> <suzworldwide at gmail.com> wrote:
>
>> Patrik,
>>
>> That was before my time on the IAB, sorry.
>>
>> But as it happens, I'm looking at an update to the IAB
>> statement, so have been pondering this very question. Do you
>> think that
>> https://www.ietf.org/id/draft-klensin-idna-5892upd-unicode70-0
>> 5.txt
>> covers the options reasonably well?
> Suzanne,
>
> Patrik may have a different answer, but, as the author of that
> document, let me provide two opinions:
>
> (1) No.  There is a newer version that covers more of the issues
> and has a better discussion of options, but I have no intention
> of posting it before there is a plan about how the document will
> be processed.   I believe I told the IAB that while Andrew was
> still Chair and told the late and lamented IAB I18N Program that
> while Andrew was still chairing it.   I do have some ideas about
> how it, and some other relevant documents might be processed,
> but, until there is evidence that someone in "the leadership" is
> interested, it has seemed to me that trying to write those
> things down (or, e.g., flying to London to try to discuss the
> issues and possibilities) would be a waste of time
>
> My amusement level is decreasing rapidly about the IAB's going
> off and discussing issues like this internally without making
> any attempt to involve members of the community who have actual
> expertise and a history of doing most of the work.  In recent
> months, I've been reminded by four well-known colleagues (two of
> them former IAB Chairs) that the IAB got itself into a situation
> in the early 1990s in which the community concluded that it had
> gotten out of touch, thought its opinions were sufficient onto
> themselves, and that wasn't interested in listening to experts
> in the community who were not on the IAB.  Whether that
> characterization was accurate or not and whatever one things of
> the outcomes of the changes that followed, the results clearly
> did not go well for the IAB or much of its then-membership.
>
> (2) As much to Kim as to you... Although the IAB statement was
> based on a limited understanding of only part of the problem,
> the underlying problems, as described in the unposted
> draft-klensin-idna-5892upd-unicode70-06.txt and, to a
> considerable degree, in the expired but available
> draft-klensin-idna-5892upd-unicode70-05.txt, are unchanged.
> Given some of the behavior being exhibited by selected TLD
> registries (I note from recent news that one of them is not only
> selling names that are invalid under IDNA2008 (and that would
> remain invalid if the tables were updated to Unicode 10.0
> without changes being made to the Standard itself) but that
> their record-keep systems are weak enough that they managed to
> sell the same invalid strings multiple times and had to withdraw
> the sales and that they are apparently also running a futures
> market in such names using graphemes that are expected to be
> assigned as code points in Unicode 11 and 12.
>
> There is another issue with Unicode 7 onward as far as domain
> names are concerned.  Most, if not all, of the scripts that are
> used for contemporary, widely-used, languages that are
> well-represented on the Internet were coded long before now.  As
> a result, most of the characters added in Unicode 7 and later
> fall into one of three categories:
>
> (i) Scripts that are less well-known in the Internet community
> and which therefore carry even greater risk of "surprises"
> (including but certainly not limited to, confusion with other
> code points, issues with combining characters and what
> normalization does or does not do, and special rendering issues)
> if used in domain names.   Andrew's comment about this being
> hard apply much more strongly to such scripts than to
> better-known ones.
>
> (ii) Letter- or digit-like characters added to existing scripts.
> These potentially raise all of the issues listed above but, if
> added to replace or supplement characters or character sequences
> that are in use already, the may raise issues similar to the
> relationship between Traditional and Simplified Chinese.
>
> (iii) Characters (or, more precisely code points) that would not
> qualify for use in domain name labels because of the properties
> Unicode assigned to them.  That list includes emoji (because
> they are symbols) but is definitely not limited to them.
>
> Now, there are constituencies within ICANN would would like all
> of those characters added and to start selling them.  But it
> would be irresponsible (at best) on the IETF's part (or that of
> ICANN) to encourage such things.
>
>      john
>
> _______________________________________________
> IDNA-UPDATE mailing list
> IDNA-UPDATE at ietf.org
> https://www.ietf.org/mailman/listinfo/idna-update
>

_______________________________________________
IDNA-UPDATE mailing list
IDNA-UPDATE at ietf.org
https://www.ietf.org/mailman/listinfo/idna-update

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20180305/542512c1/attachment.html>