<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>There's been a flare-up of the discussion about what to do with

      IDNA2008 (currently stuck at Unicode Version 6.3.0, compared to a

      current version of 10.0, with 11.0 upcoming).</p>

    <p>Here's  a snapshot. (For more, see <a class="moz-txt-link-abbreviated" href="mailto:idna-update@ietf.org">idna-update@ietf.org</a>).</p>

    <p>A./</p>

    <p>PS: concerns related to emoji are touched upon as well.<br>

    </p>

    <div class="moz-forward-container"><br>

      <br>

      -------- Forwarded Message --------

      <table class="moz-email-headers-table" cellspacing="0"

        cellpadding="0" border="0">

        <tbody>

          <tr>

            <th nowrap="nowrap" valign="BASELINE" align="RIGHT">Subject:

            </th>

            <td>Re: [Idna-update] [Ext] FWD: Expiration impending:

              <draft-klensin-idna-rfc5891bis-01.txt></td>

          </tr>

          <tr>

            <th nowrap="nowrap" valign="BASELINE" align="RIGHT">Date: </th>

            <td>Mon, 5 Mar 2018 19:34:22 -0800</td>

          </tr>

          <tr>

            <th nowrap="nowrap" valign="BASELINE" align="RIGHT">From: </th>

            <td>Asmus Freytag <a class="moz-txt-link-rfc2396E" href="mailto:asmusf@ix.netcom.com"><asmusf@ix.netcom.com></a></td>

          </tr>

          <tr>

            <th nowrap="nowrap" valign="BASELINE" align="RIGHT">To: </th>

            <td><a class="moz-txt-link-abbreviated" href="mailto:idna-update@ietf.org">idna-update@ietf.org</a>, Suzanne Woolf

              <a class="moz-txt-link-rfc2396E" href="mailto:suzworldwide@gmail.com"><suzworldwide@gmail.com></a></td>

          </tr>

        </tbody>

      </table>

      <br>

      <br>

      <pre>Last summer, I spent some time doing a detailed analysis of modern 

scripts (the ones that the lion's share of IDNs, other than some vanity 

ones, will be sold in), and limited to the current approved subset 

(Unicode 6.3.0). This survey surfaced a considerable fraction of 

potentially "troublesome" characters; fundamentally not different from 

the one that caused the halt in updating the IDNA208 tables. This survey 

benefited from input beyond my own personal experience, including input 

and data both from Unicode experts engaged in reviewing Unicode's tables 

of code intentionally identical code points as well as expertise and 

data generated as part of ICANN's Root Zone LGR project.

A draft version of the survey was made public as ID 

<a class="moz-txt-link-freetext" href="https://datatracker.ietf.org/doc/draft-freytag-troublesome-characters/">https://datatracker.ietf.org/doc/draft-freytag-troublesome-characters/</a>

The results were discussed ad-hoc with IETF and IAB experts at various 

occasions and by e-mail. These discussions lead to some conclusions.

The following points are worth noting:

(1) A small, but significant number of both *existing* code points and 

combining sequences exhibit the same issues as the code point that lead 

to the IAB recommendation to halt the update of IDNA2008 tables. This 

problem can be characterized as code points having identical (or near 

identical) appearance with the code point differences not folded by 

normalization. (The problem case found in 2015 only had "near identical" 

appearance.)

(2) Objectively, halting the process did and does nothing about existing 

"troublesome characters". Common to all is the issue that disallowing 

repertoire elements is generally a poor way of mitigating the issues - 

mainly because doing so would arbitrarily favor letters over digits, or 

one language or writing system over another. The exception is that 

disallowing some combining marks intended for more technical purposes 

would prioritize regular text, and that would address a subset of these 

cases. All others require different mitigation approaches well within 

the scope of registration policies.

(3) For existing scripts, the pending additions are not expected to 

significantly add to the existing problem. Compared to the number of 

existing cases, the pending additions are few, and would be addressed 

with the same mitigation approaches. (Pending additions are not 

predominantly identical to existing code points; the vast majority are 

added because they are, in fact, different, isolated exceptions 

notwithstanding).

(4) For new scripts, John's observation that they are mostly not 

modern-use means that problems are inherently limited by several 

factors. Most of these scripts do not represent useful markets, as few 

people can read them. The exception are decorative scripts like 

hieroglyphics, which in turn have distinct appearance. For 

non-decorative ones, some may exhibit code points that are accidentally 

close in appearance to code points in other scripts. Whether there would 

be font resources to present code points in these obsolete scripts is 

doubtful; other attack vectors would promise easier success. (User 

agents flagging unusual obsolete scripts would not create many false 

positives.)

(5) The issue of non-normalized identical appearance pales in 

significance compared to other issues with existing code points for 

modern, widely used scripts. For Chinese and Arabic, not implementing 

variants is arguably not state of the art; doing so allows unregulated 

registration of confusing labels on a large scale. For all South East 

Asian / Indic scripts, not implementing context rules for a good portion 

(about a third) of the characters for each script is not state of the 

art; doing so would allow labels that can't be interpreted by either 

layout engines or readers, as in these scripts, characters do not exist 

in isolation. For zones implementing multiple scripts, not blocking 

cross-script homoglyphs is arguably also not state of the art; doing so 

leads to whole-label confusables that are impossible to detect by 

inspection.

(6) By freezing the update of IDNA2008 tables, IAB effectively declares 

that IDNA2008 is "stuck in the past". This incrementally increases the 

pressure on / temptation for various operators to unilaterally move 

beyond IDNA2008. If such "wild catting" can cloak itself in the moral 

mantle of support for some minority languages, it provides cover for 

those cynically selling emoji labels.

(7) The only way forward is to re-synchronize the updates with the 

current version of Unicode (11.0 is being readied and there will be a 

new version every year). This must be coupled with simultaneously 

addressing the gaps between "PVALID" and the state of the art.

(8) The ability to define blocked variants (code points that can be 

substituted for each other by an unsuspecting user) and to robustly 

exclude later registrations differing only by such a variant from an 

earlier registration can take care of nearly all of the troublesome 

characters, as well as of the problem of cross-script homoglyphs. This 

is currently being implemented in the Root Zone for a growing list of 

modern scripts (eventually 28 of them). Other zones can build on that 

work (for example by integrating digits and hyphen, and optionally by 

extending that analysis to scripts that "look alike" some modern ones).

(9) The security implications of not defining context rules for the 

South East Asian/Indic scripts need to be publicized in a way that 

non-specialists can understand the need and know where to look for 

examples (such as the Root Zone LGRs) that successfully implement these. 

(While IDNA2008 provides context rules for some special code points the 

protocol level may not be the most appropriate place for general context 

rules; some rules could be made stricter/looser for zones that are 

limited to specific scripts/languages).

(10) The security implications of emoji are not understood and need to 

be publicized in ways that is accessible to those that might be tempted 

to flout the rules to get a "cute" label. There is enormous public 

interest in emoji and it is spread across users of nearly all modern 

scripts. Given the enormous pressure driving demand for these it is 

frankly surprising how limited the wild catting has been up to now. The 

concern would be that simply pointing to an existing standard 

(especially one that artificially limits itself the the past - that is 

Unicode 6.3.0) may not prove sufficient counter; there is a chance that 

a clear explanation why emoji labels cannot be made secure, might give 

at least some users pause.

(11) Finally, none of the issues discussed here exist in isolation. The 

next layer of the onion is "similarity" or plain "confusable" labels. In 

an environment where .com and .corn or apple.com and app1e.com may 

legitimately coexist (as far as the protocol is concerned) and where 

that situation is much worse in many scripts, it is simply questionable 

that an action as drastic as anchoring the IDNA2008 protocol at an 

arbitrary and (as we have seen) not problem-free point is justified or 

in proportion.

This reality calls for a layered approach where each layer takes care of 

its slice of the problem:

(a) The protocol takes care of all cases that can be remedied by 

property-based inclusion applicable to all languages, and the occasional 

context rule of global applicability. RFC 5891, 5892 etc.

(b) The next layer out consists of some tailoring of the repertoire for 

each zone, but importantly, also implementing the state of the art when 

it comes to blocking identical variants. This would rely on the 

technology around LGRs and RFCs 7940 and 8228

(c) The outer layer would finally address the issue of the halo of 

"confusables". There's room for new technologies allowing automated 

string similarity scoring and review.

A./

On 3/5/2018 3:25 PM, John C Klensin wrote:

>

> --On Monday, March 5, 2018 16:23 -0500 Suzanne Woolf

> <a class="moz-txt-link-rfc2396E" href="mailto:suzworldwide@gmail.com"><suzworldwide@gmail.com></a> wrote:

>

>> Patrik,

>>

>> That was before my time on the IAB, sorry.

>>

>> But as it happens, I'm looking at an update to the IAB

>> statement, so have been pondering this very question. Do you

>> think that

>> <a class="moz-txt-link-freetext" href="https://www.ietf.org/id/draft-klensin-idna-5892upd-unicode70-0">https://www.ietf.org/id/draft-klensin-idna-5892upd-unicode70-0</a>

>> 5.txt

>> covers the options reasonably well?

> Suzanne,

>

> Patrik may have a different answer, but, as the author of that

> document, let me provide two opinions:

>

> (1) No.  There is a newer version that covers more of the issues

> and has a better discussion of options, but I have no intention

> of posting it before there is a plan about how the document will

> be processed.   I believe I told the IAB that while Andrew was

> still Chair and told the late and lamented IAB I18N Program that

> while Andrew was still chairing it.   I do have some ideas about

> how it, and some other relevant documents might be processed,

> but, until there is evidence that someone in "the leadership" is

> interested, it has seemed to me that trying to write those

> things down (or, e.g., flying to London to try to discuss the

> issues and possibilities) would be a waste of time

>

> My amusement level is decreasing rapidly about the IAB's going

> off and discussing issues like this internally without making

> any attempt to involve members of the community who have actual

> expertise and a history of doing most of the work.  In recent

> months, I've been reminded by four well-known colleagues (two of

> them former IAB Chairs) that the IAB got itself into a situation

> in the early 1990s in which the community concluded that it had

> gotten out of touch, thought its opinions were sufficient onto

> themselves, and that wasn't interested in listening to experts

> in the community who were not on the IAB.  Whether that

> characterization was accurate or not and whatever one things of

> the outcomes of the changes that followed, the results clearly

> did not go well for the IAB or much of its then-membership.

>

> (2) As much to Kim as to you... Although the IAB statement was

> based on a limited understanding of only part of the problem,

> the underlying problems, as described in the unposted

> draft-klensin-idna-5892upd-unicode70-06.txt and, to a

> considerable degree, in the expired but available

> draft-klensin-idna-5892upd-unicode70-05.txt, are unchanged.

> Given some of the behavior being exhibited by selected TLD

> registries (I note from recent news that one of them is not only

> selling names that are invalid under IDNA2008 (and that would

> remain invalid if the tables were updated to Unicode 10.0

> without changes being made to the Standard itself) but that

> their record-keep systems are weak enough that they managed to

> sell the same invalid strings multiple times and had to withdraw

> the sales and that they are apparently also running a futures

> market in such names using graphemes that are expected to be

> assigned as code points in Unicode 11 and 12.

>

> There is another issue with Unicode 7 onward as far as domain

> names are concerned.  Most, if not all, of the scripts that are

> used for contemporary, widely-used, languages that are

> well-represented on the Internet were coded long before now.  As

> a result, most of the characters added in Unicode 7 and later

> fall into one of three categories:

>

> (i) Scripts that are less well-known in the Internet community

> and which therefore carry even greater risk of "surprises"

> (including but certainly not limited to, confusion with other

> code points, issues with combining characters and what

> normalization does or does not do, and special rendering issues)

> if used in domain names.   Andrew's comment about this being

> hard apply much more strongly to such scripts than to

> better-known ones.

>

> (ii) Letter- or digit-like characters added to existing scripts.

> These potentially raise all of the issues listed above but, if

> added to replace or supplement characters or character sequences

> that are in use already, the may raise issues similar to the

> relationship between Traditional and Simplified Chinese.

>

> (iii) Characters (or, more precisely code points) that would not

> qualify for use in domain name labels because of the properties

> Unicode assigned to them.  That list includes emoji (because

> they are symbols) but is definitely not limited to them.

>

> Now, there are constituencies within ICANN would would like all

> of those characters added and to start selling them.  But it

> would be irresponsible (at best) on the IETF's part (or that of

> ICANN) to encourage such things.

>

>      john

>

> _______________________________________________

> IDNA-UPDATE mailing list

> <a class="moz-txt-link-abbreviated" href="mailto:IDNA-UPDATE@ietf.org">IDNA-UPDATE@ietf.org</a>

> <a class="moz-txt-link-freetext" href="https://www.ietf.org/mailman/listinfo/idna-update">https://www.ietf.org/mailman/listinfo/idna-update</a>

>

_______________________________________________

IDNA-UPDATE mailing list

<a class="moz-txt-link-abbreviated" href="mailto:IDNA-UPDATE@ietf.org">IDNA-UPDATE@ietf.org</a>

<a class="moz-txt-link-freetext" href="https://www.ietf.org/mailman/listinfo/idna-update">https://www.ietf.org/mailman/listinfo/idna-update</a>

</pre>

    </div>

  </body>

</html>