[arabic-vip] ZWNJ (earlier: [vip] Overarching principles used in Devanagari team report)

Siavash Shahshahani shahshah at irnic.ir
Thu Sep 29 10:45:58 UTC 2011


Dear Sarmad,
I am in substantive agreement with your document. There are a couple of
things I wish to call your careful attention to. I have indicated these in
redline in the attached document.
thank you 
Siavash

On Thu, 29 Sep 2011 03:38:32 +0500, "Sarmad Hussain"
<sarmad.hussain at kics.edu.pk> wrote:
> Dear All,
> 
> Apologies for cluttering your mail box, but I still seemed to miss a few
> details.  Here is yet another revised version.  Please review this
version.
> 
> Regards,
> Sarmad
> 
> ------
> 
> 1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in Arabic
> script to allow for final shape of a dual-joining letter to appear in
the
> middle of a word for correct orthographic representation of some words
in
> some languages, e.g. when, in languages like Farsi and Urdu, the prefix
or
> a suffix of a word does not join with the root as in فیض آباد instead of
> فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such forms are also used in
> writing borrowed words from English and other languages.  
> 
> The community discussed both the need to allow ZWNJ and issues that may
> arise by using it in great detail.  Following arguments were given in
the
> discussion:
> 
> a.	Arguments in favor of allowing ZWNJ for TLDs
> It is needed to correctly render label which needs a non-final
> dual-joining letter to occur in the final form (making the word to
appear
> disjoint, where it would be joined otherwise).
> i)	Linguistically this can occur in multiple cases: 
> a prefix which is disjoint with the root; a root which is disjoint with
> the suffix; arbitrary disjoint word due to orthographic convention (many
> times this occurs for borrowed words from English and other languages);
> transliteration of abbreviations like UN into at least some Arabic
script
> based languages; one word followed by another word, where space is not
> allowed to separate these words (e.g. in labels)
> ii)	It is in use by some language communities, e.g. Kurdish .  
> iii)	It is being proposed by various language communities to be used and
> added to their keyboards to write IDN labels, similar to adding ‘@’ sign
to
> write email addresses.  For e.g. national consensus and policy for
.پاکستان
> IDN ccTLD requires ZWNJ to be added and available for use in the labels
. 
> It is already available on some other keyboards 
> iv)	It is required to render the brand names properly, e.g. پیپسی کولا
> (“Pepsi Cola”)
> 
> The ZWNJ is visible in all these cases due to the significant change in
> letter shaping and is needed for many languages, including Farsi, Urdu,
> Kurdish, etc.  
> 
> b.	Arguments in favor of not allowing ZWNJ for TLDs
> i)	Labels need not capture linguistic conventions and may be treated as
a
> string instead of a linguistic word, which makes word-based arguments
> redundant 
> ii)	The character is not visible, even though it changes the shape of
the
> letters around and thus may be a security risk
> iii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as
per
> the requirement defined by the gTLD Applicant Guidebook (v 2011-09-19,
> Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
> iv)	It is not generally familiar to and in use by end users and thus
users
> may type <space> to try to type the character which will not resolve (or
> alternately resolve to a different result if the sub-string before the
ZWNJ
> is registered as a separate TLD)
> v)	It is not available on most keyboards
> vi)	Root policy should be more conservative than labels for other levels
> 
> 
> c.	Possible solutions  
> The solutions can be very conservative or very liberal.  The right
balance
> is needed, where maximal variety in Label Generation Policy is needed
> without compromising any security.  The community needs to discuss this
> issue in greater detail to finalize the solution.  A possible set of
> possibilities include (liberal to conservative):
> 
> i)	ZWNJ is allowed, however the string with it is considered a variant
of
> the string without it.  This addresses KB, confusability and security
> issues (but gives the users the choice and flexibility based on their
> language)
> 
> ii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
> without it must also be allocated
> 
> iii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
> without it must also be allocated.  Additionally, the label with ZWNJ
> cannot be a fundamental label, but can only be a variant
> 
> iv)	ZWNJ is not allowed in a TLD label at this time
> 
> 
> d.	Additional restrictions
> 
> Though there is a defined rule which allows ZWNJ only in contexts where
> its effect is visible, there are few contexts which ZWNJ may still not
have
> a visible impact.  
> i)	This includes characters U+0637, U+0638 and U+069F.  This is
indicated
> by the two sequences, one with and one without the ZWNJ:  طب ط‌ب.  The
ZWNJ
> should not be permitted following these three characters, in addition to
> the constraint already put on its use by the IDNA 2008 protocol (see RFC
> 5893)
> ii)	Similarly, ZWNJ should not be allowed after Heh Group (see Appendix
> A.1) as the shape change due to its occurrence is not visible
> 
> 
> 
> 
> 
> 
> 
> 
>>>-----Original Message-----
>>>From: Sarmad Hussain [mailto:sarmad.hussain at kics.edu.pk]
>>>Sent: Thursday, September 29, 2011 3:13 AM
>>>To: 'arabic-vip at icann.org'
>>>Subject: RE: [arabic-vip] ZWNJ (earlier: [vip] Overarching principles
>>>used in Devanagari team report)
>>>
>>>Slightly revised:
>>>
>>>----------------
>>>
>>>1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in
>>>Arabic script to allow for final shape of a dual-joining letter to
>>>appear in the middle of a word for correct orthographic representation
>>>of some words in some languages, e.g. when, in languages like Farsi and
>>>Urdu, the prefix or a suffix of a word does not join with the root as
>>>in فیض آباد instead of فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such
>>>forms
>>>are also used in writing borrowed words from English and other
>>>languages.
>>>
>>>The community discussed both the need to allow ZWNJ and issues that may
>>>arise by using it in great detail.  Following arguments were given in
>>>the discussion:
>>>
>>>a.	Arguments in favor of allowing ZWNJ for TLDs
>>>It is needed to correctly render label which needs a non-final dual-
>>>joining letter to occur in the final form (making the word to appear
>>>disjoint, where it would be joined otherwise).
>>>i)	Linguistically this can occur in multiple cases:
>>>a prefix which is disjoint with the root; a root which is disjoint with
>>>the suffix; arbitrary disjoint word due to orthographic convention
>>>(many times this occurs for borrowed words from English and other
>>>languages); one word followed by another word, where space is not
>>>allowed to separate these words (e.g. in labels)
>>>ii)	It is in use by some language communities, e.g. Kurdish .
>>>iii)	It is being proposed by various language communities to be used
>>>and added to their keyboards to write IDN labels, similar to adding ‘@’
>>>sign to write email addresses.  For e.g. national consensus and policy
>>>for .پاکستان IDN ccTLD requires ZWNJ to be added and available for use
>>>in
>>>the labels .
>>>iv)	It is required to render the brand names properly
>>>
>>>The ZWNJ is visible in all these cases due to the significant change in
>>>letter shaping and is needed for many languages, including Farsi, Urdu,
>>>Kurdish, etc.
>>>
>>>b.	Arguments in favor of not allowing ZWNJ for TLDs
>>>i)	The character is not visible, even though it changes the shape of
>>>the letters around and thus may be a security risk
>>>ii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as
>>>per the requirement defined by the gTLD Applicant Guidebook (v 2011-09-
>>>19, Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
>>>iii)	It is not generally familiar to and in use by end users and thus
>>>users may type <space> to try to type the character which will not
>>>resolve (or alternately resolve to a different result if the sub-string
>>>before the ZWNJ is registered as a separate TLD)
>>>iv)	It is not available on most keyboards
>>>
>>>
>>>c.	Possible solutions
>>>The solutions can be very conservative or very liberal.  The right
>>>balance is needed, where maximal variety in Label Generation Policy is
>>>needed without compromising any security.  The community needs to
>>>discuss this issue in greater detail to finalize the solution.  A
>>>possible set of possibilities include (liberal to conservative):
>>>
>>>i)	ZWNJ is allowed, however the string with it is considered a
>>>variant of the string without it.  This addresses KB, confusability and
>>>security issues (but gives the users the choice and flexibility based
>>>on their language)
>>>
>>>ii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
>>>without it must also be allocated
>>>
>>>iii)	ZWNJ is allowed, however, if ZWNJ is allocated, then the variant
>>>without it must also be allocated.  Additionally, the label with ZWNJ
>>>cannot be a fundamental label, but can only be a variant
>>>
>>>iv)	ZWNJ is not allowed in a TLD label at this time
>>>
>>>
>>>
>>>
>>>>>-----Original Message-----
>>>>>From: Sarmad Hussain [mailto:sarmad.hussain at kics.edu.pk]
>>>>>Sent: Thursday, September 29, 2011 3:04 AM
>>>>>To: 'arabic-vip at icann.org'
>>>>>Subject: RE: [arabic-vip] ZWNJ (earlier: [vip] Overarching principles
>>>>>used in Devanagari team report)
>>>>>
>>>>>Dear All,
>>>>>
>>>>>Here is the draft of the text I am putting in the revised version.
>>>>>Again, I am aiming to capture the discussion, without suggesting a
>>>>>solutions.  However, the intention is to capture all the knowledge
>>>>>generated in our discussions for future reference (in the "solution"
>>>>>phase of the IDN VIP).  Please let me know if this is appropriate and
>>>>>feel free to suggest any changes.
>>>>>
>>>>>Regards,
>>>>>Sarmad
>>>>>
>>>>>=================
>>>>>
>>>>>
>>>>>1.	200C (ZWNJ): Zero-Width-Non-Joiner (ZWNJ) is character used in
>>>>>Arabic script to allow for final shape of a dual-joining letter to
>>>>>appear in the middle of a word for correct orthographic
>>>representation
>>>>>of some words in some languages, e.g. when, in languages like Farsi
>>>and
>>>>>Urdu, the prefix or a suffix of a word does not join with the root as
>>>>>in فیض آباد instead of فیضآباد (“Faiz-Abad” in Urdu and Farsi).  Such
>>>>>forms
>>>>>are also used in writing borrowed words from English and other
>>>>>languages.
>>>>>
>>>>>The community discussed both the need to allow ZWNJ and issues that
>>>may
>>>>>arise by using it in great detail.  Following arguments were given in
>>>>>the discussion:
>>>>>
>>>>>a.	Arguments in favor of allowing ZWNJ for TLDs
>>>>>It is needed to correctly render label which needs a non-final dual-
>>>>>joining letter to occur in the final form (making the word to appear
>>>>>disjoint, where it would be joined otherwise).
>>>>>i)	Linguistically this can occur in multiple cases:
>>>>>a prefix which is disjoint with the root; a root which is disjoint
>>>with
>>>>>the suffix; arbitrary disjoint word due to orthographic convention
>>>>>(many times this occurs for borrowed words from English and other
>>>>>languages); one word followed by another word, where space is not
>>>>>allowed to separate these words (e.g. in labels)
>>>>>ii)	It is in use by some language communities, e.g. Kurdish .
>>>>>iii)	It is being proposed by various language communities to be
>>>used
>>>>>and added to their keyboards to write IDN labels, similar to adding
>>>‘@’
>>>>>sign to write email addresses.  For e.g. national consensus and
>>>policy
>>>>>for .پاکستان IDN ccTLD requires ZWNJ to be added and available for
use
>>>in
>>>>>the labels .
>>>>>iv)	It is required to render the brand names properly
>>>>>
>>>>>The ZWNJ is visible in all these cases due to the significant change
>>>in
>>>>>letter shaping and is needed for many languages, including Farsi,
>>>Urdu,
>>>>>Kurdish, etc.
>>>>>
>>>>>b.	Arguments in favor of not allowing ZWNJ for TLDs
>>>>>i)	The character is not visible, even though it changes the shape of
>>>>>the letters around and thus may be a security risk
>>>>>ii)	It is not one with the general category of { Ll, Lo, Lm, Mn }, as
>>>>>per the requirement defined by the gTLD Applicant Guidebook (v 2011-
>>>09-
>>>>>19, Module 2, page 2-13, Section 2.2.1.3.2, Part II, Item 2.1.3.)
>>>>>iii)	It is not generally familiar to and in use by end users
>>>>>iv)	It is not available on most keyboards
>>>>>
>>>>>
>>>>>c.	Possible solutions
>>>>>The solutions can be very conservative or very liberal.  The right
>>>>>balance is needed, where maximal variety in Label Generation Policy
>>>is
>>>>>needed without compromising any security.  The community needs to
>>>>>discuss this issue in greater detail to finalize.  A possible set of
>>>>>possibilities include:
>>>>>
>>>>>i)	ZWNJ is allowed, however the string with it is considered a
>>>>>variant of the string without it.  This addresses KB, confusability
>>>and
>>>>>security issues (but gives the users the choice and flexibility based
>>>>>on their language)
>>>>>
>>>>>ii)	If we want to be more conservative, we can suggest that if ZWNJ
>>>>>is allocated, then the variant without it must also be allocated
>>>>>
>>>>>iii)	If we want to be even more conservative, we can also
>>>suggest that
>>>>>the label with ZWNJ cannot be a fundamental label, but can be a
>>>variant
>>>>>
>>>>>iv)	Finally, the most conservative formulation is not to allow ZWNJ
>>>>>even in the variant at this time
>>>>>
>>>>>d.	Additional restrictions
>>>>>
>>>>>Though there is a defined rule which allows ZWNJ only in contexts
>>>where
>>>>>its effect is visible, there are few contexts which ZWNJ may still
>>>not
>>>>>have a visible impact.
>>>>>i)	This includes characters U+0637, U+0638 and U+069F.  This is
>>>>>indicated by the two sequences, one with and one without the ZWNJ:
>>>طب
>>>>>ط‌ب.  The ZWNJ should not be permitted following these three
>>>characters,
>>>>>in addition to the constraint already put on its use by the IDNA 2008
>>>>>protocol (see RFC 5893)
>>>>>ii)	Similarly, ZWNJ should not be allowed after Heh Group (see
>>>>>Appendix A.1) as the shape change due to its occurrence is not
>>>visible
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>-----Original Message-----
>>>>>>>From: arabic-vip-bounces at icann.org [mailto:arabic-vip-
>>>>>>>bounces at icann.org] On Behalf Of Andrew Sullivan
>>>>>>>Sent: Wednesday, September 28, 2011 9:58 PM
>>>>>>>To: arabic-vip at icann.org
>>>>>>>Subject: Re: [arabic-vip] ZWNJ (earlier: [vip] Overarching
>>>principles
>>>>>>>used in Devanagari team report)
>>>>>>>
>>>>>>>On Wed, Sep 28, 2011 at 09:45:49PM +0500, Sarmad Hussain wrote:
>>>>>>>> Dear Andrew and All,
>>>>>>>>
>>>>>>>>
>>>>>>>> >>The difficulty is that the current ICANN policy for accepting a
>>>>>>>label
>>>>>>>> >>into the root (and the proposal in the Internet Draft as well)
>>>>>>>> >>requires that all code points in a candidate label have general
>>>>>>>> >>category of (one of) { Ll, Lo, Lm, Mn }.  Since ZWNJ is not one
>>>>>of
>>>>>>>> >>those, it would not be permitted.  So if this really is a
>>>>>>>requirement,
>>>>>>>> >>it needs to be stated quite strongly.
>>>>>>>>
>>>>>>>>
>>>>>>>> Could somebody email a reference to this ICANN Policy, for
>>>purposes
>>>>>>>of
>>>>>>>> documentation.
>>>>>>>
>>>>>>>It's in the _gTLD Applicant Guidebook_, v 2011-09-19, Module 2, p2-
>>>>>13,
>>>>>>>section 2.2.1.3.2, Part II, item 2.1.3.
>>>>>>>
>>>>>>>Best,
>>>>>>>
>>>>>>>A
>>>>>>>
>>>>>>>--
>>>>>>>Andrew Sullivan
>>>>>>>ajs at anvilwalrusden.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sarmad on ZWNJ.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 18949 bytes
Desc: not available
Url : http://mm.icann.org/pipermail/arabic-vip/attachments/20110929/f86f6bb5/SarmadonZWNJ-0001.docx 


More information about the arabic-vip mailing list