[arabic-vip] Typographical Complexity of Arabic Script

Behnam Esfahbod behnam at esfahbod.info
Tue Jul 19 08:13:01 UTC 2011


Andrew,
John,

Thanks for the extensive replies to my initial comment on TLD
DNS-Labels I-D.  I would like to explain the crucial necessity of ZWNJ
in Arabic script, and some general difficulties of this script, in a
little more detail here for you and others, not intimately familiar
with the variety of languages using the Arabic script.

Arabic script, in its most basic form, is best tailored to the Arabic
language. A salient point of Arabic language is its heavy dependence
on declension and inflection ('sarf' in Arabic), rather than the use
of compound words. This is in stark contrast with most other languages
in Middle East and South Asia that use the Arabic script. These
languages (Persian, Urdu, Kurdish, Pashto,...), are mostly
Indo-Iranian and rely heavily on making compound nouns, very much like
the German language. The abundance of long nouns in these languages
has made a relaxation of joining-letter tradition in Arabic writing
necessary. Thus a character in Arabic script which is ALWAYS joined to
its following character in Arabic words, *may or may not* be joined to
its following character in the above languages, sometimes producing
two different legitimate words in the language. This is where ZWNJ
comes in to distinguish between the two words.

In your replies I get examples of English names that are not supported
in DNS-Labels. Let me first note that in my opinion the fact that
"DNS-Labels are broken for English at level X" does not excuse having
them broken for all other languages at minimum level of X. We should
collaborate to make DNS-Labels usable and practical for as many
languages as possible. Second, I did not talk about proper names, but
common nouns like "houses", "mountains", and "ruler".  This compound
nouns are the very basic words of any language, including Persian.

Now, let's look at a few PVALID characters, which can be considered
"confusing" in %99.99 of cases. Please look at U+0618 ARABIC SMALL
FATHA and U+064E ARABIC FATHA. These are both PVALID characters, but
even native Arabic-script users would confuse them with each other.
Or, let's look at U+0649 ARABIC LETTER ALEF MAKSURA and U+06CC ARABIC
LETTER FARSI YEH, which (as defined by Unicode standard) have exactly
the same look in two joining forms, Final and Isolated. We have a long
list of these issues, and even worst ones that I don't want to bring
up in this discussion.

Anyway, for most of these issues, it's impossible to make a general
rule and have some characters "disallowed" in some level. So, the
question is: how ZWNJ is different from any of these cases? How you
conclude that an ICANN staff or IANA Root-Zone administrator maybe be
confused by ZWNJ, but all the other cases are just "fine" for them?

But let's not allow this complexity worry you so much. With a good
typographical model (hopefully from Unicode), most of these problems
can be well defined and resolved appropriately. This has been my
concern in the past few years and I think I have made a very good
progress, which I am going to share with the Unicode Consumerism at
IUC35 in October [1].

Hope you find these helpful, and please let me know if you want any of
these cases explained more.

Best regards,
-Behnam

1: http://www.unicodeconference.org/conference-at-a-glance.htm


-- 
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '      http://behnam.esfahbod.info
  *  ..   http://zwnj.org/
 *  `  *  http://persian-computing.ir
  * o *   3E7F B4B6 6F4C A8AB 9BB9 7520 5701 CA40 259E 0F8B



More information about the arabic-vip mailing list