[arabic-vip] ZWNJ Possible Risks/Issues

Andrew Sullivan ajs at anvilwalrusden.com
Mon Sep 26 15:22:33 UTC 2011


Dear colleagues,

I've given quite a lot of thought to ZWNJ since our meeting.  My
previous remarks in this thread have been an attempt tp urge the team
to a clear statement on this matter, particularly now that the new
final Guidelines document is published.  I'll review the new proposed
text after I send this note.  But first, I want to offer some expanded
remarks that reflect my personal views, developed after our meeting.
You may take this as advice from me in my capacity as subject matter
expert.

This message is a little long, but I'm trying to be clear.  I'm not
picking on Alireza; but his message exposed perfectly two of the lines
of thinking I've been pursuing, so I'm going to use his message as a
"hook" to hang this on.

On Mon, Sep 26, 2011 at 05:45:03PM +0330, Alireza Saleh wrote:

> However there are also many other examples that show the necessity
> of using this character such as those sent earlier by others

The problem with these examples, as I've noted before, is that they
start from the assumption, "Here is a word that is common in some
language; therefore, it needs to be permitted as a TLD label."  That
premise has not yet, as near as I can see, been supported by much of
an argument.  Moreover, it is not a premise that I (or, I suspect, any
other DNS protocol expert) will grant.

To begin with, a large number of the root labels are not words.  None
of the country code TLDs are words -- or, at least, when they are
words, they are not words that mean "that country".  Of the
non-ccTLDs, the following are also not words:

    aero
    arpa
    biz
    cat
    com
    coop
    edu
    gov
    info
    int
    mil
    mobi
    org
    pro
    tel
    xxx

Some of those are common abbreviations, like info and pro.  The rest
might be read as abbreviations.  They are certainly _meaningful_: they
are intended to convey some sense of the purpose of the domain.  But
that is not the same as being words.  And one can argue pretty
strongly that at least one of them -- coop -- is misspelled, since to
communicate what that domain is intended to mean, one needs a hyphen
(co-op).  "Coop" means the place where you keep chickens (a chicken
coop).

Now, there is a problem that confronts us in some contexts: whereas in
English we have a tradition of abbreviation, some languages don't have
that tradition.  It is therefore awkward to make analogies between
English and, say, Hindi.  But we are altering the policy for
registration in the root zone, and to minimize the negative effects of
such a change one needs to do the best one can with analogies.  For
practical purposes, this means (I think) three things:

    1.  Short, potentially meaningful labels are to be preferred.

    2.  Labels do not need actually to be words.

    3.  Non-words can be close to meaningful without really being
    meaningful on their own.

(1) comes by analogy from the bulk of existing labels (travel and
museum are outliers); (2) is just entailed by the fact that many
labels aren't; and (3) comes from the analysis of how (2) plays out in
fact.

>From this, I think it follows that the test is not merely whether many
words in some language use ZWNJ; nor even whether some of those words
are short.  Instead, the test should be, at least at the beginning,
whether a restriction of ZWNJ is so restrictive as to make it very
difficult to register useful mnemonics for some language community.
So far, I have not seen such an argument.

Note that a restriction on ZWNJ (and ZWJ, for that matter) need not be
the outright ban currently in the Guidelines.  For instance, one could
have a restriction that said, "Not allowed, unless you come up with a
very strong argument for why nothing else will work.  This will be
subject to review by experts in the language."  I am neither
advocating nor opposing such a restriction; I'm merely observing that
one could have different restrictions than are contemplated by the
currently-published policy.

> and most are risk-free.

The other issue that is critical in the root is this matter of risk.
The problem with a zero-width character is that by its very nature, it
is not itself visible to the user.  The result is that a user
attempting to deal with the string has to have a theory about how it
is represented: unlike every other character, the user has to know
that there is this invisible character there, and has to know how it
interacts with the other characters around it.

In the root, the risk is not, "Will this work for some set of users?"
nor, "Will this fail to work in some contexts?"  It's instead, "Will
this sometimes cause users to be confused such that they end up going
to the wrong place?"  All of the examples so far have been examples of
how some users will understand the string and nobody else will be able
to use it.  In order to be convinced that ZWNJ is risk-free, however,
we'd need a convincing argument that a string that could somehow be
confused with the ZWNJ case could not be registered.

Moreover, the team has already come up with cases there the CONTEXTJ
rule in IDNA2008 is met, but the ZWNJ can't be seen anyway.  Given the
report points to font issues and ways that a font can break the user's
expectations (or can be incomprehensible to a user if the writing
style is not what the user is used to), it seems to me there is plenty
of reason to believe that these examples are not risk free.  

The proposals for ZWNJ-free variants might get us there, but I'd sure
like much broader arguments (to the effect that the group is sure
there are no other corner cases) before saying ZWNJ is a good idea.  

One proposal might be to plan a study of actual ZWNJ use in some gTLD
aimed at (say) a pan-Arabic-script and see whether there are negative
effects, as a precondition for beginning use in the root.  I don't
know how realistic such a plan would be, nor whether we'd get usable
results (I'm not actually sure how I'd design such a study); but it
might be better than starting with the root zone, where removing a
label in an effort to fix a mistake will be all but impossible.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com



More information about the arabic-vip mailing list