[arabic-vip] Variants, spelling, words, and the top level

Tue Jul 5 09:39:13 UTC 2011

Thanks Andrew ..
A few comments inline below ..

Kind Regards

--Manal

________________________________

من: arabic-vip-bounces at icann.org بالنيابة عن Andrew Sullivan
تاريخ الإرسال: الاثنين 04/07/2011 10:50 م
إلى: arabic-vip at icann.org
الموضوع: [arabic-vip] Variants, spelling, words, and the top level (was: Next team call: Monday 4 July 6:00AM UTC)

Dear colleagues,

Dr Hussain's email from this morning gives me an opportunity to
reflect on something I think we need to keep before our minds in
exploring these issues.

On Mon, Jul 04, 2011 at 11:23:06AM -0700, Sarmad Hussain wrote:

>
> A1.3.1 conflating level 2 marks (less significance, higher optionality)
> (e.g. fatha)
>
> A1.3.2 conflating level 1 marks  (higher significance, lower optionality)
> (e.g. hamza above/below)
>
> A1.3.3 conflating level 0(?) marks  (highest significance, no optionality)
> (e.g. double-fatha) (could this be A 2.1?)

I heard quite a lot of discussion around these issues on this
morning's call.  In particular, there was considerable discussion of
how users are likely to use these and other marks, and how users will
react to their presence or absence.

Some of the examples referred explicitly to the presence or absence of
the marks changing meaning considerably.  This thread of the
conversation reminded me of something that came up recently on the
IETF apps-discuss list, in a thread that starts here:

http://www.ietf.org/mail-archive/web/apps-discuss/current/msg02873.html

The issues are completely different -- one having to do with the marks
and the other having to do with characters that are only sometimes
valid in IDNA2008 -- but there is a key similarity.  That similarity
is the issue of the meaning of a label.

Agree .. but the issue is not only similarity in meaning .. I believe what's more of an issue is visual confusion, sometimes between 2 strings with completely different meanings .. 

I think one may look at variants as strings bundled together due to: visual similarity, semantic similarity, phonetic similarity or even an obligation towards a certain community where more than one script/language are being used and maybe other reasons .. so I think we should be careful which of those fall within our scope as far as the Arabic script communitites are concerned ..

Another question I have, do variants include, overlap or has nothing to do with defensive registrations .. and if it overlaps then where to draw the line ?

I think it is critical that we keep in our minds an important
principle: the DNS does not have any words in it.  It has labels.
Those labels sometimes, to some eyes, look like words.  But they are
not words, and rules that govern the use of words in a language are
simply not applicable to the DNS.  Indeed, the very reason we are able
to have IDNA is exactly that xn--6da00oz8aks does not need to be a
word, even though it is a valid A-label.  (It corresponds to the
U-label ????, which is a U-label that nobody should ever permit to be
registered as a matter of policy, but which is, I think, still legal
under IDNA2008 if I've done my bidi checks correctly.  If it's wrong,
no matter: we could generate another example.)

I fully agree .. but from a user perspective it may be looked at:

- that DNS cannot support all dictionary words of a certain language and

- that DNS can support words that don't necessarily have a meaning .. 

I just think that same things could be expressed differently depending on whether we're talking to technical people or language community (registrants & end-users). I fully agree that it should be clear that we sometimes have to make certain compromises and this is particularly obvious when we talk about variants .. It should be understood that certain strings/labels/words, if registered, would automatically block other variant strings/labels/words from being registered.   

Another effect of the "labels are not words" principle is that things
that are words need not be acceptable in the DNS as labels.  The DNS
itself allows labels to be any arbitrary octets you like.  But we have
had policy restrictions historically that prevent lots of "words" from
being labels.  In English, for instance, we don't allow registration
of labels with the apostrophe, even though many people's names are
spelled that way.  This is all in order to maximize the utility of the
DNS while minimizing the chances that some part of the creaky name
infrastructure we use on the Internet will break.

Agree .. and I think this is exactly our role, knowing the scipt community needs and the technical limitation we have, we should try to find a middle ground and hence come up with policies that are in one hand technically implementable and on the other hand satisfies the script community needs to the best possible .. 

Now, I believe that for the purposes of understanding the issues, we
should uncover as much as is practical the cases where various
conventions of local usage and spelling cause trouble.  Issues aroung
things like important marks in normal spelling need to be considered
and brought forward, for it is only in understanding the actual use
people make of the script that we will understand the trade-offs we
must make. 

Fully agree .. 

But we must also identify, I think, cases where, for the benefit of
interoperability with everyone else on the Internet, it is better to
introduce restrictions on what is to be registered than to permit
possibly confusing registrations or extremely complicated validation
rules.  This is ever more important the closer to the root and generic
space one gets: for a label to go into the root zone (i.e. to be used
as a TLD) it needs to be the most broadly accessible.  That probably
means identifying ways in which users will have to bend their
expectations to the technology as well as the technology bending to
meet user expectations.  Perhaps in the case of some of these marks,
for instance, it would be a better trade-off to expect users to learn
that "those marks don't work sometimes" than to expect software
applications to have the correct sensitivity to context.  (Again, to
draw an imperfect analogy, people who speak English learned long ago
to try to look up osullivan-carpentry.com instead of
o'sullivan-carpentry.com, even though the former is a preposterous
misspelling of "O'Sullivan".  French speakers learned to use
dentremont.qc.ca instead of d'entremont.qc.ca, even though nobody who
speaks French would ever misspell d'Entremont as dentremont.)

Fully agree .. we have to be particularly cautious at the root level .. we have to admit and find out the right compromise between what technology allows & disallows vs. the community needs and expectations .. we also have to admit a certain level of compromise among the different Arabic-script-based language communities for interoperability and security purposes at the script/technical level ..   

I hope this note is clear as to my aim, which is not to say what we
should do in the end, but that we should keep in mind the practical
limitations of the primitive technology that is the DNS.  Please feel
free to tell me the ways in which I am full of rubbish!

Not at all .. to me everything you mentioned makes a lot of sense .. 

Best,

A

--
Andrew Sullivan
ajs at anvilwalrusden.com