[vip] The "Invisible Separator Characters" Issue

Nicholas Ostler nicholas at ostler.net
Thu Jul 28 16:06:15 UTC 2011


First of all, I apologize if I am using non-standard terminology. What I 
mean specifically is this: Unicode code-points that have no specific 
rendering of their own, but may affect the rendering of neighbouring 
codepoints in the string. Specifically we have been talking about
U+200D ZWJ zero-width joiner
and
U+200C ZWNJ zero-width non-joiner
both of which are actively used for some languages that use the Arabic 
and the Devanagari scripts. 
http://en.wikipedia.org/wiki/Zero-width_non-joiner shows that they can 
also make a difference in German spelling, and in Hebrew.

Unicode calls these "special characters", which is not very helpful.

People have been talking explicitly about their necessity for Persian 
and some other Indo-Aryan languages written in Arabic,
e.g. behnam at esfahbod.info - [arabic-vip] Typographical Complexity of 
Arabic Script - 19/07/2011
and for Nepali, among languages written in Devanagari,
e.g. bkbal at ltk.org.np - Re: [Devanagari-vip] Document for Hindi Language 
based on Policy for ccTLD .bharat - 03/07/2011

By their very nature - these characters, when placed where they would 
have no effect on rendering,  allow identical renderings to be 
associated with the distinct strings of abstract characters.  They go 
beyond the use of other "combining" codepoints (such as accents and 
cedillas, which (taken with other neighbouring codepoints) also 
sometimes result in a glyph indistinguishable from another codepoint), 
in that they can have this effect on any string of codepoints. So U+0301 
[´] COMBINING ACUTE ACCENT combined with U+0065 [e] LATIN SMALL LETTER E 
can result in a glyph indistinguishable from U+00E9 [é] LATIN SMALL 
LETTER E WITH ACUTE; but ZWNJ or ZWJ placed anywhere in a string (as 
long as its neighbours are not combining charactyers) will always give a 
string outwardly indistinguishable from one without it.

Seeing this danger of their use in URL identifiers, (essentially an open 
invitation to spoofing),  the Indian approach was been simply to outlaw 
them:
akshatj at cdac.in - Document for Hindi Language based on Policy for ccTLD 
.bharat - 01/07/2011

However, a less radical, but more complicated, approach, which allows 
ZWJ and ZWNJ to be used where necessary, is laid out by Unicode at
http://unicode.org/review/pr-96.html
Public Review Issue #96 - Allowing Special Characters in Identifiers - 
Revision 3 - 04-19-2007
(It generalizes the issue slightly, to bring in Mongolian separators.)

This has involved looking at the use of the characters in a wide variety 
of languages (going beyond our 5 case-studies) and trying to 
characterize objectively the environments where ZWJ or ZWNJ could make a 
difference to rendering, and allow them in identifiers only in these 
environments.

Those concerned about this issue for their languages (notably Nepali, 
Persian etc.) may wish to consider this approach as a concrete option.

-- 
Nicholas Ostler

nicholas at ostler.net
+44 (0)1225-852865, (0)7720-889319

Chairman: Foundation for Endangered Languages
www.ogmios.org

Author: Empires of the Word (2005),
Ad Infinitum (2007), The Last Lingua Franca (2010)
www.nicholasostler.com





More information about the vip mailing list