[vip] The "Invisible Separator Characters" Issue
Nicholas Ostler
nicholas at ostler.net
Thu Jul 28 16:06:15 UTC 2011
First of all, I apologize if I am using non-standard terminology. What I
mean specifically is this: Unicode code-points that have no specific
rendering of their own, but may affect the rendering of neighbouring
codepoints in the string. Specifically we have been talking about
U+200D ZWJ zero-width joiner
and
U+200C ZWNJ zero-width non-joiner
both of which are actively used for some languages that use the Arabic
and the Devanagari scripts.
http://en.wikipedia.org/wiki/Zero-width_non-joiner shows that they can
also make a difference in German spelling, and in Hebrew.
Unicode calls these "special characters", which is not very helpful.
People have been talking explicitly about their necessity for Persian
and some other Indo-Aryan languages written in Arabic,
e.g. behnam at esfahbod.info - [arabic-vip] Typographical Complexity of
Arabic Script - 19/07/2011
and for Nepali, among languages written in Devanagari,
e.g. bkbal at ltk.org.np - Re: [Devanagari-vip] Document for Hindi Language
based on Policy for ccTLD .bharat - 03/07/2011
By their very nature - these characters, when placed where they would
have no effect on rendering, allow identical renderings to be
associated with the distinct strings of abstract characters. They go
beyond the use of other "combining" codepoints (such as accents and
cedillas, which (taken with other neighbouring codepoints) also
sometimes result in a glyph indistinguishable from another codepoint),
in that they can have this effect on any string of codepoints. So U+0301
[´] COMBINING ACUTE ACCENT combined with U+0065 [e] LATIN SMALL LETTER E
can result in a glyph indistinguishable from U+00E9 [é] LATIN SMALL
LETTER E WITH ACUTE; but ZWNJ or ZWJ placed anywhere in a string (as
long as its neighbours are not combining charactyers) will always give a
string outwardly indistinguishable from one without it.
Seeing this danger of their use in URL identifiers, (essentially an open
invitation to spoofing), the Indian approach was been simply to outlaw
them:
akshatj at cdac.in - Document for Hindi Language based on Policy for ccTLD
.bharat - 01/07/2011
However, a less radical, but more complicated, approach, which allows
ZWJ and ZWNJ to be used where necessary, is laid out by Unicode at
http://unicode.org/review/pr-96.html
Public Review Issue #96 - Allowing Special Characters in Identifiers -
Revision 3 - 04-19-2007
(It generalizes the issue slightly, to bring in Mongolian separators.)
This has involved looking at the use of the characters in a wide variety
of languages (going beyond our 5 case-studies) and trying to
characterize objectively the environments where ZWJ or ZWNJ could make a
difference to rendering, and allow them in identifiers only in these
environments.
Those concerned about this issue for their languages (notably Nepali,
Persian etc.) may wish to consider this approach as a concrete option.
--
Nicholas Ostler
nicholas at ostler.net
+44 (0)1225-852865, (0)7720-889319
Chairman: Foundation for Endangered Languages
www.ogmios.org
Author: Empires of the Word (2005),
Ad Infinitum (2007), The Last Lingua Franca (2010)
www.nicholasostler.com
More information about the vip
mailing list