[Idngwg] Minutes of meeting and AIs from 1 Dec.

Tue Dec 6 15:31:41 UTC 2016

Dear All,

Thanks to good comments from Sarmad I have an update of my text. It feels quite long, though, but could be a source for discussion on Thursday.

<AI2-rev2>

An analysis of homoglyphs and confusable variants MUST be executed on the IDN table or tables (LGR or text table) for a specific TLD (or DNS zone). The analysis MUST be done both within each IDN table and between all IDN tables for the TLD (or zone).

For the root zone extensive analysis of homoglyphs and confusable variants within the scope of that work, and that analysis can be valuable input to the analysis for TLDs and other zones. The work for the root zone is ongoing, and at the time of writing, the analysis still does not cover all Unicode scripts and languages. It should also be noted that character set for the root zone has been limited to letters and characters equivalent to letters. E.g. digits and punctuations are not permitted in the root zone and therefore excluded from that analysis. This means that the analysis for most other zones must go beyond what has been done for the root zone.

There are several cases to consider.

One case are homoglyphs between different Unicode scripts. Well-known such homoglyphs are found in Armenian, Cyrillic, Greek, and Latin scripts (but are also found between several other Unicode scripts). Usually different Unicode scripts are found in different IDN tables, and in most cases it is not permissible to mix different Unicode scripts (except Common code points) in the same domain name label (and exception for e.g. Chinese labels that can mix Han and Latin code points). So for this case the analysis must usually be done between different IDN tables.

The second case are homoglyphs within the same Unicode script. This could still mean that the analysis has to include several IDN tables if different tables cover different languages.

The third case are within-script variants (beyond within-script homoglyphs). One such example are U+0643 and U+06AA that Arabic language speakers consider to be calligraphic variations, whereas Sindhi speakers consider these as two different characters. Again, these conflicting code points could be found in the same or different IDN tables.
If homoglyphs are found, harmonization MUST be performed. The goal of the harmonization is to acheive system that prevents two domain names, under the same TLD (domain), that are homoglyphs of each other to be registered by different registrants, as far as possible. This is to reach a workable and secure system.

References:

"Homoglyph", <https://en.wikipedia.org/wiki/Homoglyph>
"Unicode Security Mechanisms", Technical Standard #39, http://unicode.org/reports/tr39/
"intentional.txt" (see TS#39), ftp://ftp.unicode.org/Public/security/revision-02/intentional.txt
"confusables.txt", (see TS#39), ftp://ftp.unicode.org/Public/security/revision-02/confusables.txt
"Internationalized Domain Names Registration and Administration Guidelines for European Languages Using Cyrillic", appendix A, https://tools.ietf.org/html/rfc5992
"Proposals for Root Zone Label Generation Ruleset (LGR)", https://www.icann.org/resources/pages/lgr-proposals-2015-12-01-en

</AI2-rev2>

Mats

---
Mats Dufberg
DNS Specialist, IIS
Mobile: +46 73 065 3899
https://www.iis.se/en/

From: Sarmad Hussain <sarmad.hussain at icann.org>
Date: Monday 5 December 2016 at 19:19
To: Mats Dufberg <mats.dufberg at iis.se>, idngwg <idngwg at icann.org>
Subject: RE: [Idngwg] Minutes of meeting and AIs from 1 Dec.

Thanks Mats.

Kindly note a couple of comments:

> An analysis of homoglyphs (*) MUST be executed on the IDN table or tables (LGR or text table) for a specific TLD (or DNS zone). The analysis MUST be done both within each IDN table and between all IDN tables for the TLD (or zone). Homoglyphs can be within a single Unicode script or between different Unicode scripts. The analysis done for the root zone can be a source to discover possible homoglyphs, but it should be noted that character set has been limited to letters and characters equivalent to letters. E.g. digits and punctuations are not permitted in the root zone and therefore excluded from that analysis. Well-known homoglyphs in different Unicode scripts are found in Armenian, Cyrillic, Greek, and Latin scripts.

There may be two different cases here, we should consider.  (i) cross-script homoglyphs, and (ii) within-script variants (beyond within-script homoglyphs).  The example of کتاب presented earlier does indeed only point out to within-script homoglyphs (for Persian and Arabic languages) for Arabic script.  However, extending the example further, Arabic script community also considers ڪتاب a variant because Arabic language speakers consider 0643 and 06AA as calligraphic variations, however Sindhi<http://www.omniglot.com/writing/sindhi.htm> speakers consider these as two different characters.  Such cases should be handled in harmonization as well because if it is possible to register them separately (due to Sindhi), Arabic language users could face (almost) homoglyphic-level confusion which harmonization is seeking to prevent.

> If homoglyphs are found, harmonization MUST be performed. The goal of the harmonization is to acheive system where it is not possible to register two domain names, under the same TLD (domain), that are homographs of each other. This is to reach a workable and secure system.

Please note that in the case of (ii) above, it may be able to register the two domain labels and even make them active.  The condition is that it should not be possible to have two different registrants register such labels.  This is already captured by an earlier recommendation we have written which prevents registrations of variants to different registrants, though we can reiterate it if needed.

Regards,
Sarmad

From: Mats Dufberg [mailto:mats.dufberg at iis.se]
Sent: Monday, December 05, 2016 10:19 PM
To: Sarmad Hussain <sarmad.hussain at icann.org>; idngwg at icann.org
Subject: [EXTERNAL] Re: [Idngwg] Minutes of meeting and AIs from 1 Dec.

Dear all,

I will look through Sarmad's report below, and it might affect my text, but since we have meeting soon I send it to you directly, and I might update it later.

<AI2>
An analysis of homoglyphs (*) MUST be executed on the IDN table or tables (LGR or text table) for a specific TLD (or DNS zone). The analysis MUST be done both within each IDN table and between all IDN tables for the TLD (or zone). Homoglyphs can be within a single Unicode script or between different Unicode scripts. The analysis done for the root zone can be a source to discover possible homoglyphs, but it should be noted that character set has been limited to letters and characters equivalent to letters. E.g. digits and punctuations are not permitted in the root zone and therefore excluded from that analysis. Well-known homoglyphs in different Unicode scripts are found in Armenian, Cyrillic, Greek, and Latin scripts.

If homoglyphs are found, harmonization MUST be performed. The goal of the harmonization is to acheive system where it is not possible to register two domain names, under the same TLD (domain), that are homographs of each other. This is to reach a workable and secure system.

There are different ways to handle the possible homoglyphs, and decision of which way to go is up to the register of the actual TLD (domain). One possibility is to exclude code points so that there are no homographs. If the state homoglyphs is only reached when the code point is in a certain position of the label or together with some other code point, then contextual rules can be used to prohibit such positions. The third technique is to use blocking variant rules. If the homoglyphs are from different IDN tables, then the variant rules must operate on all the IDN tables.

*) "In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar." (<https://en.wikipedia.org/wiki/Homoglyph>)

**) Reference at Unicode Consortium
</AI2>

---
Mats Dufberg
DNS Specialist, IIS
Mobile: +46 73 065 3899
https://www.iis.se/en/

From: <idngwg-bounces at icann.org<mailto:idngwg-bounces at icann.org>> on behalf of Sarmad Hussain <sarmad.hussain at icann.org<mailto:sarmad.hussain at icann.org>>
Date: Sunday 4 December 2016 at 03:19
To: idngwg <idngwg at icann.org<mailto:idngwg at icann.org>>
Subject: Re: [Idngwg] Minutes of meeting and AIs from 1 Dec.

Dear All,

Regarding the AI1 below, I have inquired about what has been published at it seems there are some sources, which we need to discuss in case we need to refer to them.  Here are the options:

1.       Some work has come out of Unicode’s Technical Standard 39[unicode.org]<https://urldefense.proofpoint.com/v2/url?u=http-3A__unicode.org_reports_tr39_&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=oI6rDV9sjVjiGaFxkUuzIF7nBYf4JOXiLYvDL7nzBns&e=>.  There are two files available here[ftp.unicode.org]<https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.unicode.org_Public_security_revision-2D02&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=YowMbZ-y-bZkA4lWI7-hQdXTT_8TOVWz8g3Ac_feTws&e=>:

a.       Confusables.txt[ftp.unicode.org]<https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.unicode.org_Public_security_revision-2D02_confusables.txt&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=ml8odOdAVI3LMVysjtndCqKyifWxCLC0OAwfZm_Ze28&e=> is the larger set, which has the data we need, but much more data, as the definition of confusables is perhaps broader than the strict homoglyphs we may want.

b.       Intentional.txt[ftp.unicode.org]<https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.unicode.org_Public_security_revision-2D02_intentional.txt&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=NZjkBLhZJ-nhYX-bVjUDcFwpoVDO6Ohqp0IpsJXmz38&e=> is perhaps the subset which we may be looking for, though the list seems ominously short, and would need a more thorough review (which I am happy to perform after our discussion).

2.       RFC 5992 has data in its appendices, which also lists confusable code points, but is not restricted to homoglyphs.

3.       The Root Zone LGR proposals[icann.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_resources_pages_lgr-2Dproposals-2D2015-2D12-2D01-2Den&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=V6kG_2axEb8GGvaaB5RJdLUIMtB091M1PJ-bQe6YPHw&e=> will also provide a reasonably comprehensive list (though we already discussed that these may still be limited for the second level).  For example, see the list in Section 6 of the Armenian script proposal[icann.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.icann.org_en_system_files_files_armenian-2Dlgr-2Dproposal-2D05nov15-2Den.pdf&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=YCpNxsXrlDdPaNE44-JdMPdT_LIfpgmOKtN1YkqRopY&e=> already published for a subset.  Latin, Greek and Cyrillic GPs are also working on such lists, so we will have a multi-script community confirmation, once we have the other proposals.
However, it is interesting to note that this work remains largely limited to Cyrillic, Green and Latin homoglyphs.  Analysis is needed for other scripts.  We have recently concluded the analysis for Lao, Khmer and Thai scripts as part of the Root Zone LGR work (and the communities have not found significant homoglyph contexts – e.g. see the Khmer and Lao proposals published (Thai on its way soon)).  However, there may be some work needed for Neo-Brahmi scripts.

Regards,
Sarmad

From: idngwg-bounces at icann.org<mailto:idngwg-bounces at icann.org> [mailto:idngwg-bounces at icann.org] On Behalf Of Sarmad Hussain
Sent: Friday, December 02, 2016 2:00 PM
To: idngwg at icann.org<mailto:idngwg at icann.org>
Subject: [Idngwg] Minutes of meeting and AIs from 1 Dec.

Dear All,

Please find attached summary of the meeting of the WG on 1 Dec.  Please let me know if there are any changes or suggestions.

The meeting had the following AIs:

S. No.

Action Items

Owner

1

Find out if there are existing lists of homoglyphs which can be referenced

SH

2

Divide new recommendation on harmonization of LGRs into three recommendations, explaining harmonization, address cross-script homoglyphic variants, and address within-script variants caused by two different LGRs

MD

3

Write a new recommendation on how to address existing registrations which are not harmonized, giving flexibility to registries

KF

4

Re-write the recommendation on automatic activation based on the current input for further discussion

EC

The next meeting is schedule for 8 Dec. 11am UTC.

The attached notes of the meeting and the recording of the meeting are available at the IDNGWG wiki page at https://community.icann.org/display/IDN/IDN+Implementation+Guidelines[community.icann.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.icann.org_display_IDN_IDN-26-2343-3BImplementation-26-2343-3BGuidelines&d=DgMGaQ&c=FmY1u3PJp6wrcrwll3mSVzgfkbPSS6sJms7xcl4I5cM&r=KTETvEaGPwPcawI-QmNa-kiv-ZBvdgyyLm-mxd028M4&m=ZZwUOZWYEgo6xDkgrkbv-28uIEp-POkTJN_AEbVtF4k&s=0ySpwQKVM73thAhy0ridyEGYkf4l6vzm0siDSIZBXlU&e=>.

Regards,
Sarmad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/idngwg/attachments/20161206/9450bd06/attachment-0001.html>