[arabic-vip] ZWNJ (earlier: [vip] Overarching principles used in Devanagari team report)

Sarmad Hussain sarmad.hussain at kics.edu.pk
Wed Sep 28 22:38:32 UTC 2011


Dear All,

Apologies for cluttering your mail box, but I still seemed to miss a few details.  Here is yet another revised version.  Please review this version.

Regards,
Sarmad

------

1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in Arabic script to allow for final shape of a dual-joining letter to appear in the middle of a word for correct orthographic representation of some words in some languages, e.g. when, in languages like Farsi and Urdu, the prefix or a suffix of a word does not join with the root as in فیض آباد instead of فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such forms are also used in writing borrowed words from English and other languages.  

The community discussed both the need to allow ZWNJ and issues that may arise by using it in great detail.  Following arguments were given in the discussion:

a.	Arguments in favor of allowing ZWNJ for TLDs
It is needed to correctly render label which needs a non-final dual-joining letter to occur in the final form (making the word to appear disjoint, where it would be joined otherwise).
i)	Linguistically this can occur in multiple cases: 
a prefix which is disjoint with the root; a root which is disjoint with the suffix; arbitrary disjoint word due to orthographic convention (many times this occurs for borrowed words from English and other languages); transliteration of abbreviations like UN into at least some Arabic script based languages; one word followed by another word, where space is not allowed to separate these words (e.g. in labels)
ii)	It is in use by some language communities, e.g. Kurdish .  
iii)	It is being proposed by various language communities to be used and added to their keyboards to write IDN labels, similar to adding ‘@’ sign to write email addresses.  For e.g. national consensus and policy for .پاکستان IDN ccTLD requires ZWNJ to be added and available for use in the labels .  It is already available on some other keyboards 
iv)	It is required to render the brand names properly, e.g. پیپسی کولا (“Pepsi Cola”)

The ZWNJ is visible in all these cases due to the significant change in letter shaping and is needed for many languages, including Farsi, Urdu, Kurdish, etc.  

b.	Arguments in favor of not allowing ZWNJ for TLDs
i)	Labels need not capture linguistic conventions and may be treated as a string instead of a linguistic word, which makes word-based arguments redundant 
ii)	The character is not visible, even though it changes the shape of the letters around and thus may be a security risk
iii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as per the requirement defined by the gTLD Applicant Guidebook (v 2011-09-19, Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
iv)	It is not generally familiar to and in use by end users and thus users may type <space> to try to type the character which will not resolve (or alternately resolve to a different result if the sub-string before the ZWNJ is registered as a separate TLD)
v)	It is not available on most keyboards
vi)	Root policy should be more conservative than labels for other levels


c.	Possible solutions  
The solutions can be very conservative or very liberal.  The right balance is needed, where maximal variety in Label Generation Policy is needed without compromising any security.  The community needs to discuss this issue in greater detail to finalize the solution.  A possible set of possibilities include (liberal to conservative):

i)	ZWNJ is allowed, however the string with it is considered a variant of the string without it.  This addresses KB, confusability and security issues (but gives the users the choice and flexibility based on their language)

ii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant without it must also be allocated

iii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant without it must also be allocated.  Additionally, the label with ZWNJ cannot be a fundamental label, but can only be a variant

iv)	ZWNJ is not allowed in a TLD label at this time


d.	Additional restrictions

Though there is a defined rule which allows ZWNJ only in contexts where its effect is visible, there are few contexts which ZWNJ may still not have a visible impact.  
i)	This includes characters U+0637, U+0638 and U+069F.  This is indicated by the two sequences, one with and one without the ZWNJ:  طب ط‌ب.  The ZWNJ should not be permitted following these three characters, in addition to the constraint already put on its use by the IDNA 2008 protocol (see RFC 5893)
ii)	Similarly, ZWNJ should not be allowed after Heh Group (see Appendix A.1) as the shape change due to its occurrence is not visible








>>-----Original Message-----
>>From: Sarmad Hussain [mailto:sarmad.hussain at kics.edu.pk]
>>Sent: Thursday, September 29, 2011 3:13 AM
>>To: 'arabic-vip at icann.org'
>>Subject: RE: [arabic-vip] ZWNJ (earlier: [vip] Overarching principles
>>used in Devanagari team report)
>>
>>Slightly revised:
>>
>>----------------
>>
>>1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in
>>Arabic script to allow for final shape of a dual-joining letter to
>>appear in the middle of a word for correct orthographic representation
>>of some words in some languages, e.g. when, in languages like Farsi and
>>Urdu, the prefix or a suffix of a word does not join with the root as
>>in فیض آباد instead of فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such forms
>>are also used in writing borrowed words from English and other
>>languages.
>>
>>The community discussed both the need to allow ZWNJ and issues that may
>>arise by using it in great detail.  Following arguments were given in
>>the discussion:
>>
>>a.	Arguments in favor of allowing ZWNJ for TLDs
>>It is needed to correctly render label which needs a non-final dual-
>>joining letter to occur in the final form (making the word to appear
>>disjoint, where it would be joined otherwise).
>>i)	Linguistically this can occur in multiple cases:
>>a prefix which is disjoint with the root; a root which is disjoint with
>>the suffix; arbitrary disjoint word due to orthographic convention
>>(many times this occurs for borrowed words from English and other
>>languages); one word followed by another word, where space is not
>>allowed to separate these words (e.g. in labels)
>>ii)	It is in use by some language communities, e.g. Kurdish .
>>iii)	It is being proposed by various language communities to be used
>>and added to their keyboards to write IDN labels, similar to adding ‘@’
>>sign to write email addresses.  For e.g. national consensus and policy
>>for .پاکستان IDN ccTLD requires ZWNJ to be added and available for use in
>>the labels .
>>iv)	It is required to render the brand names properly
>>
>>The ZWNJ is visible in all these cases due to the significant change in
>>letter shaping and is needed for many languages, including Farsi, Urdu,
>>Kurdish, etc.
>>
>>b.	Arguments in favor of not allowing ZWNJ for TLDs
>>i)	The character is not visible, even though it changes the shape of
>>the letters around and thus may be a security risk
>>ii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as
>>per the requirement defined by the gTLD Applicant Guidebook (v 2011-09-
>>19, Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
>>iii)	It is not generally familiar to and in use by end users and thus
>>users may type <space> to try to type the character which will not
>>resolve (or alternately resolve to a different result if the sub-string
>>before the ZWNJ is registered as a separate TLD)
>>iv)	It is not available on most keyboards
>>
>>
>>c.	Possible solutions
>>The solutions can be very conservative or very liberal.  The right
>>balance is needed, where maximal variety in Label Generation Policy is
>>needed without compromising any security.  The community needs to
>>discuss this issue in greater detail to finalize the solution.  A
>>possible set of possibilities include (liberal to conservative):
>>
>>i)	ZWNJ is allowed, however the string with it is considered a
>>variant of the string without it.  This addresses KB, confusability and
>>security issues (but gives the users the choice and flexibility based
>>on their language)
>>
>>ii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
>>without it must also be allocated
>>
>>iii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
>>without it must also be allocated.  Additionally, the label with ZWNJ
>>cannot be a fundamental label, but can only be a variant
>>
>>iv)	ZWNJ is not allowed in a TLD label at this time
>>
>>
>>
>>
>>>>-----Original Message-----
>>>>From: Sarmad Hussain [mailto:sarmad.hussain at kics.edu.pk]
>>>>Sent: Thursday, September 29, 2011 3:04 AM
>>>>To: 'arabic-vip at icann.org'
>>>>Subject: RE: [arabic-vip] ZWNJ (earlier: [vip] Overarching principles
>>>>used in Devanagari team report)
>>>>
>>>>Dear All,
>>>>
>>>>Here is the draft of the text I am putting in the revised version.
>>>>Again, I am aiming to capture the discussion, without suggesting a
>>>>solutions.  However, the intention is to capture all the knowledge
>>>>generated in our discussions for future reference (in the "solution"
>>>>phase of the IDN VIP).  Please let me know if this is appropriate and
>>>>feel free to suggest any changes.
>>>>
>>>>Regards,
>>>>Sarmad
>>>>
>>>>=================
>>>>
>>>>
>>>>1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in
>>>>Arabic script to allow for final shape of a dual-joining letter to
>>>>appear in the middle of a word for correct orthographic
>>representation
>>>>of some words in some languages, e.g. when, in languages like Farsi
>>and
>>>>Urdu, the prefix or a suffix of a word does not join with the root as
>>>>in فیض آباد instead of فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such forms
>>>>are also used in writing borrowed words from English and other
>>>>languages.
>>>>
>>>>The community discussed both the need to allow ZWNJ and issues that
>>may
>>>>arise by using it in great detail.  Following arguments were given in
>>>>the discussion:
>>>>
>>>>a.	Arguments in favor of allowing ZWNJ for TLDs
>>>>It is needed to correctly render label which needs a non-final dual-
>>>>joining letter to occur in the final form (making the word to appear
>>>>disjoint, where it would be joined otherwise).
>>>>i)	Linguistically this can occur in multiple cases:
>>>>a prefix which is disjoint with the root; a root which is disjoint
>>with
>>>>the suffix; arbitrary disjoint word due to orthographic convention
>>>>(many times this occurs for borrowed words from English and other
>>>>languages); one word followed by another word, where space is not
>>>>allowed to separate these words (e.g. in labels)
>>>>ii)	It is in use by some language communities, e.g. Kurdish .
>>>>iii)	It is being proposed by various language communities to be
>>used
>>>>and added to their keyboards to write IDN labels, similar to adding
>>‘@’
>>>>sign to write email addresses.  For e.g. national consensus and
>>policy
>>>>for .پاکستان IDN ccTLD requires ZWNJ to be added and available for use
>>in
>>>>the labels .
>>>>iv)	It is required to render the brand names properly
>>>>
>>>>The ZWNJ is visible in all these cases due to the significant change
>>in
>>>>letter shaping and is needed for many languages, including Farsi,
>>Urdu,
>>>>Kurdish, etc.
>>>>
>>>>b.	Arguments in favor of not allowing ZWNJ for TLDs
>>>>i)	The character is not visible, even though it changes the shape of
>>>>the letters around and thus may be a security risk
>>>>ii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as
>>>>per the requirement defined by the gTLD Applicant Guidebook (v 2011-
>>09-
>>>>19, Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
>>>>iii)	It is not generally familiar to and in use by end users
>>>>iv)	It is not available on most keyboards
>>>>
>>>>
>>>>c.	Possible solutions
>>>>The solutions can be very conservative or very liberal.  The right
>>>>balance is needed, where maximal variety in Label Generation Policy
>>is
>>>>needed without compromising any security.  The community needs to
>>>>discuss this issue in greater detail to finalize.  A possible set of
>>>>possibilities include:
>>>>
>>>>i)	ZWNJ is allowed, however the string with it is considered a
>>>>variant of the string without it.  This addresses KB, confusability
>>and
>>>>security issues (but gives the users the choice and flexibility based
>>>>on their language)
>>>>
>>>>ii)	If we want to be more conservative, we can suggest that if ZWNJ
>>>>is allocated, then the variant without it must also be allocated
>>>>
>>>>iii)	If we want to be even more conservative, we can also
>>suggest that
>>>>the label with ZWNJ cannot be a fundamental label, but can be a
>>variant
>>>>
>>>>iv)	Finally, the most conservative formulation is not to allow ZWNJ
>>>>even in the variant at this time
>>>>
>>>>d.	Additional restrictions
>>>>
>>>>Though there is a defined rule which allows ZWNJ only in contexts
>>where
>>>>its effect is visible, there are few contexts which ZWNJ may still
>>not
>>>>have a visible impact.
>>>>i)	This includes characters U+0637, U+0638 and U+069F.  This is
>>>>indicated by the two sequences, one with and one without the ZWNJ:
>>طب
>>>>ط‌ب.  The ZWNJ should not be permitted following these three
>>characters,
>>>>in addition to the constraint already put on its use by the IDNA 2008
>>>>protocol (see RFC 5893)
>>>>ii)	Similarly, ZWNJ should not be allowed after Heh Group (see
>>>>Appendix A.1) as the shape change due to its occurrence is not
>>visible
>>>>
>>>>
>>>>
>>>>
>>>>>>-----Original Message-----
>>>>>>From: arabic-vip-bounces at icann.org [mailto:arabic-vip-
>>>>>>bounces at icann.org] On Behalf Of Andrew Sullivan
>>>>>>Sent: Wednesday, September 28, 2011 9:58 PM
>>>>>>To: arabic-vip at icann.org
>>>>>>Subject: Re: [arabic-vip] ZWNJ (earlier: [vip] Overarching
>>principles
>>>>>>used in Devanagari team report)
>>>>>>
>>>>>>On Wed, Sep 28, 2011 at 09:45:49PM +0500, Sarmad Hussain wrote:
>>>>>>> Dear Andrew and All,
>>>>>>>
>>>>>>>
>>>>>>> >>The difficulty is that the current ICANN policy for accepting a
>>>>>>label
>>>>>>> >>into the root (and the proposal in the Internet Draft as well)
>>>>>>> >>requires that all code points in a candidate label have general
>>>>>>> >>category of (one of) { Ll, Lo, Lm, Mn }.  Since ZWNJ is not one
>>>>of
>>>>>>> >>those, it would not be permitted.  So if this really is a
>>>>>>requirement,
>>>>>>> >>it needs to be stated quite strongly.
>>>>>>>
>>>>>>>
>>>>>>> Could somebody email a reference to this ICANN Policy, for
>>purposes
>>>>>>of
>>>>>>> documentation.
>>>>>>
>>>>>>It's in the _gTLD Applicant Guidebook_, v 2011-09-19, Module 2, p2-
>>>>13,
>>>>>>section 2.2.1.3.2, Part II, item 2.1.3.
>>>>>>
>>>>>>Best,
>>>>>>
>>>>>>A
>>>>>>
>>>>>>--
>>>>>>Andrew Sullivan
>>>>>>ajs at anvilwalrusden.com




More information about the arabic-vip mailing list