[Neobrahmigp] FW: [Ext] Queries on Bangla LGR

Sun Sep 29 17:11:00 UTC 2019

Dear NBGP members, 

We have received queries from Mamun Or Rashid regarding the Bangla LGR. In consultation with the IP, we responded as follows. 

This might be useful information and Mamun Or Rashid agreed that we share to all members. 

This is for your information. 

Regards,

Pitinan

From: Mohammad Mamun Or Rashid <mamunbd at juniv.edu>
Date: Saturday, September 28, 2019 at 10:39
To: Pitinan Kooarmornpatana <pitinan.koo at icann.org>
Cc: Sarmad Hussain <sarmad.hussain at icann.org>
Subject: Re: [Ext] Queries on Bangla LGR

Thank you very much. 

On Sat, Sep 28, 2019 at 4:02 PM Pitinan Kooarmornpatana <pitinan.koo at icann.org> wrote:

Dear Mamun Or Rashid, 

Please find the response which we prepare in consultation with the IP as follow. 

-------------------

 Q 01: What is the process to submit the proposal for modification/up-gradation of IDNA protocol? Necessarily, how IDN project is related with IDNA protocol and IETF.

ICANN uses the standards such as IDNA defined by IETF. The ICANN LGR process is required to use IDNA2008 protocol as defined by IETF (RFC 5890, RFC 5891, RFC 5893 and RFC 5894). Proposals for modifications to the IDNA protocol should be submitted to IETF, per their process. However, any change to IDNA2008 will likely take several years; after any new RFCs published by IETF, a corresponding change in the ICANN LGR process would be required, which also may take several years.

Q 02: Could you please explain the stability principle of the IDNA protocol? [as IDNA2008 has a deviation from IDNA2003, we hope it contains inclusive principle (including three characters as atomic in near future/ next IDNA protocol)]

The stability principle is defined in RFC6912 section 4.5. Please also note that normalization using NFC form is a fundamental design principle of IDNA and that normalization is defined by Unicode.

Q 03: What is the criterion for selecting a character as protocol valid (PVALID) and disallowances?

ICANN does not define IDNA, IETF does. Please consult the IDNA2008 RFCs, which are RFC 5890, RFC 5891, RFC 5893 and RFC 5894. You will find that a key concept in IDNA2008 is that the selection of characters is driven by an algorithm based on Unicode character properties. This process  is defined in the IDNA RFCs by IETF. The resulting set is published by IANA (https://www.iana.org/assignments/idna-tables).

Q 04: May we get an explanation on inconsistency occurred in IDNA2008: [All characters of Indic scripts subtitled ‘Additional Characters’ are ‘disallowed’ in the protocol, but among them interestingly some are Pvalid. i.e. Gurumukhi 0A5C, Oriya 0B5F is PVALID but Bengali 09DF, 09DC DISALLOWED].

IDNA2008 in this case simply reflects the definition of NFC: The difference is that Gurmukhi 0A5C and Oriya 0B5F do not have canonical decompositions, while Bengali 09DF and 09DC not only have such decompositions, but the decomposed form is the normalized one under NFC (due to "composition exclusions").

 Q 05: What is the selection criterion of MSR? [ If there have any reference, please suggest us]

Please see the MSR-4 Overview and Rationale document (https://www.icann.org/resources/pages/msr-2015-06-21-en [icann.org]), section 2.

Q06: Unicode has distributed two types of code-position (both atomic and decomposed) for a single character. Why did IDNA select the decomposed one? [ It is assumed that the decomposed one was allocated for making the other Indic scripts unformatted with Devanagari]

IDNA labels are in NFC, therefore the selection depends on that defined in NFC, which generally uses precomposed forms, except for a small set of Composition Exclusions.

Q07: Could you please give a maximum timeline with a specific milestone to solve the issue considering our abovementioned queries. [as NBGP have strongly recommended completing the task ASAP]

In order to be included in LGR-4, a Bangla proposal would need to be finalized by the end of CY2019 and go to public review in Q1 of CY2020.

We queried the IP on their opinion on what it would take to complete a Bangla LGR in that timeframe. According to the IP, this would easily be possible, if the LGR included enumerated sequences to specify the Bangla characters in question. Please see the document "Supporting Sequences in an LGR" for details.

The enumerated sequences in the LGR would look like

<char cp="09A1 09BC" comment="NFC form of BENGALI LETTER RRA" />

<char cp="09A2 09BC" comment="NFC form of BENGALI LETTER RHA" />

<char cp="09AF 09BC" comment="NFC form of BENGALI LETTER YYA" />

...

and would cover all consonant/nukta combinations to be allowed. There would not be an entry for 09BC on its own, nor would there be a character class CN defined.

The Integration Panel strongly suggests to the GP to look at this method which should allow the NFC representation of all labels required by Bangla to be supported for the root zone. It could result in an LGR in possibly a short timeframe. In contrast, proposing a modification of the IDNA protocol will likely take many years, if it ever succeeds.

An important item to note is that the IDNA protocol as such is very low level and its details are not necessarily visible to the end user. For example, browsers generally do not require that URLs are normalized, or that IDNs are lowercase. Instead, the browser would take the user input, case fold it, normalize it and then present it to the IDNA protocol to do the lookup. The same goes for URLs that are in hyperlinks in documents and web pages: they would remain as typed by the user.

As a result, users would not even be aware that the protocol uses sequences "under the hood" where the users typed 09DF or 09DC.

Q08: What is the tentative timeline for releasing next gTLD along with IDN.

The timeline is currently unknown, and depends on when the work by community is completed.

----------------

Should you have any further queries, kindly let us know. 

Regards,

Pitinan

From: Mohammad Mamun Or Rashid <mamunbd at juniv.edu>
Date: Tuesday, September 24, 2019 at 04:50
To: Pitinan Kooarmornpatana <pitinan.koo at icann.org>, Sarmad Hussain <sarmad.hussain at icann.org>
Subject: [Ext] Queries on Bangla LGR

Dear Pitinan

I am writing to you on behalf of the Bangladeshi work-group of Bangla LGR. We are delighted for being a part of your large-scale workflow for developing LGR so that we have spent a decent time while working together since last year.  We would like to give thanks to all members of NBGP, IP and ICANN.

You know NBGP has completed 8 LGRs out of 9 successfully. Unfortunately, the process of developing Bangla LGR became slower and was paused for a long time due to the intervention of the Bangla speaker community along with Bangla WG (Bangladesh end). To solve the issues, we have seated several times in F2F and online meetings; as a result, we neutralize at least ten minor arguments together to make the document acceptable. Regrettably, the main issue (consider three characters as atomic not decomposed/compound) is unresolved till now. Eventually, we got a strong recommendation from NBGP to complete the process ASAP. Though we believe that there is a possibility to solve the problem technically, therefore, we would like to place some queries which are very relevant to make a way-out for including discussed character.  

Q 01: What is the process to submit the proposal for modification/up-gradation of IDNA protocol? Necessarily, how IDN project is related with IDNA protocol and IETF.

Q 02: Could you please explain the stability principle of the IDNA protocol? [as IDNA2008 has a deviation from IDNA2003, we hope it contains inclusive principle (including three characters as atomic in near future/ next IDNA protocol)]

Q 03: What is the criterion for selecting a character as protocol valid (PVALID) and disallowances?

Q 04: May we get an explanation on inconsistency occurred in IDNA2008: [All characters of Indic scripts subtitled ‘Additional Characters’ are ‘disallowed’ in the protocol, but among them interestingly some are Pvalid. i.e. Gurumukhi 0A5C, Oriya 0B5F is PVALID but Bengali 09DF, 09DC DISALLOWED].

 Q 05: What is the selection criterion of MSR? [ If there have any reference, please suggest us]

Q06: Unicode has distributed two types of code-position (both atomic and decomposed) for a single character. Why did IDNA select the decomposed one? [ It is assumed that the decomposed one was allocated for making the other Indic scripts unformatted with Devanagari]

Q07: Could you please give a maximum timeline with a specific milestone to solve the issue considering our abovementioned queries. [as NBGP have strongly recommended completing the task ASAP]

Q08: What is the tentative timeline for releasing next gTLD along with IDN.

 It would be highly appreciated if you could respond to these queries. With best wishes.

On behalf of Bangladeshi workgroup

Mamun Or Rashid

Assistant Professor, Jahangirnagar University

Bangla Language Technology Specialist, Bangladesh Computer Council.

Appendix

Inconsistency in IDNA2008 protocol

09DC..09DD  : DISALLOWED  # BENGALI LETTER RRA..BENGALI LETTER RHA
09DF        : DISALLOWED  # BENGALI LETTER YYA
09BC..09C4  : PVALID      # BENGALI SIGN NUKTA..BENGALI VOWEL SIGN VOCAL
0B5F..0B63  : PVALID      # ORIYA LETTER YYA..ORIYA VOWEL SIGN VOCALIC L
0A5C        : PVALID      # GURMUKHI LETTER RRA

Three discussed characters Atomic Position: 09DC RRA, 09DD RHA, 09DF YYA   Decomposed breakdown: 09A1+09BC= 09DC RRA, 09A2+09BC=09DD RHA, 09AF+09BC=09DF YYA  

4.5 [tools.ietf.org].  Stability Principle (https://tools.ietf.org/html/rfc6912#section-4.5)
   Once a code point is permitted, it is at least very hard to stop   permitting that code point.  In public zones (including the root),   the list of code points to be permitted should change very slowly, if   at all, and usually only in the direction of permitting an addition   as time and experience indicate that inclusion of such a code point   is both safe and consistent with these principles. 

4.2 [tools.ietf.org].  Inclusion Principle (https://tools.ietf.org/html/rfc5891 [tools.ietf.org])
   Just as IDNA2008 starts from the principle that the Unicode range is
   excluded, and then adds code points according to derived properties
   of the code points, so a public zone should only permit inclusion of
   a code point if it is known to be "safe" in terms of usability and
   confusability within the context of that zone.  The default treatment
   of a code point should be that it is excluded.

Appendix A [tools.ietf.org].  Summary of Major Changes from IDNA2003

    4.   Remove the mapping and normalization steps from the protocol and
        have them, instead, done by the applications themselves,
        possibly in a local fashion, before invoking the protocol.
   5.   Change the way that the protocol specifies which characters are
        allowed in labels from "humans decide what the table of code
        points contains" to "decision about code points are based on
        Unicode properties plus a small exclusion list created by
        humans".
   6.   Introduce the new concept of characters that can be used only in
        specific contexts.
   7.   Allow typical words and names in languages such as Dhivehi and
        Yiddish to be expressed.
   9.   Remove the dot separator from the mandatory part of the
        protocol.

Error! Filename not specified.[avast.com]Virus-free. www.avast.com [avast.com] 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20190929/cc80517a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SupportingSequences-2.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 14991 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20190929/cc80517a/SupportingSequences-2-0001.docx>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4610 bytes
Desc: not available
URL: <http://mm.icann.org/pipermail/neobrahmigp/attachments/20190929/cc80517a/smime-0001.p7s>