[Cyrillic-vip] On U+02BC

Alexey Mykhaylov alexey at mobiry.com
Fri Jul 8 04:07:49 UTC 2011


Hello Andrew, 

Thank you for valuable clarification.

Considering the difference between use of apostrophe in English and Ukrainian which is essentially punctuation mark (in English) vs. letter (in Ukrainian), there might be, however, a stronger (compared to U+0027) non-technical argument for its use. Would this be accurate to summarize that even though U+02B is not a part of Cyrillic Script Table it is a part of the Language Character Repertoire for Ukrainian language and it is a matter of registry policy to permit this code point for registration or not?

Best Regards,
Alexey

-----Original Message-----
From: cyrillic-vip-bounces at icann.org [mailto:cyrillic-vip-bounces at icann.org] On Behalf Of Andrew Sullivan
Sent: Thursday, July 07, 2011 1:41 PM
To: cyrillic-vip at icann.org
Subject: [Cyrillic-vip] On U+02BC

Dear colleagues,

On our call today, the issue of the character U+02BC, MODIFIER LETTER APOSTROPHE, came up.  The character looks like this: ʼ I said I'd follow up on it.  Here's what I learned.  The background section below might be familiar to many of you, and if so you can feel free to skip it.

1.  Background

    1.1  Policy in the traditional DNS

It is important to recall that RFCs 1034 and 1035 came along with implicit policies.  There is nothing about the DNS that excludes apostrophes (') and quotation marks (") from appearing in zones.  On the contrary, DNS labels may contain any octets at all.  But the DNS standard says that, for maximum compatibility, it would be better to stick to the "hostname syntax" -- i.e. to use the "LDH rule" for domains.  In the interests of security and stability, that rule has persisted.

    1.2  What's allowed by RFC 5892

IDNA2008 has a tricky mechanism for deciding whether a Code Point is included in the protocol.  The rule is in RFC 5892.  Basically, you calculate a classification for the code point according to some derived properties.  The properties entail a certain status; anything not covered is automatically DISALLOWED.  The algorithm is written this way in Section 3 of RFC 5892:


   If .cp. .in.  Exceptions Then Exceptions(cp);
   Else If .cp. .in.  BackwardCompatible Then BackwardCompatible(cp);
   Else If .cp. .in.  Unassigned Then UNASSIGNED;
   Else If .cp. .in.  LDH Then PVALID;
   Else If .cp. .in.  JoinControl Then CONTEXTJ;
   Else If .cp. .in.  Unstable Then DISALLOWED;
   Else If .cp. .in.  IgnorableProperties Then DISALLOWED;
   Else If .cp. .in.  IgnorableBlocks Then DISALLOWED;
   Else If .cp. .in.  OldHangulJamo Then DISALLOWED;
   Else If .cp. .in.  LetterDigits Then PVALID;
   Else DISALLOWED;

".cp." stands for "code point", i.e. "this code point under consideration".  As you can see, there are two cases where a class automatically results in PVALID: LDH and LetterDigits.  The order of these steps is significant.

LDH is the set of everything in the traditional LDH label.  

LetterDigits is defined by the property of the Code Point established by Unicode.  Anything with a property of any one of {Ll, Lu, Lo, Nd, Lm, Mn, Mc} will be PVALID _as long as_ it does not fall into any of the categories that go before it.  For instance, capital letters would qualify as LetterDigits, except that they're captured by the Unstable rule first (because they're not stable under NFKC and case folding).  

2.  The status of U+02BC

It turns out, perhaps surprisingly, that U+02BC, MODIFIER LETTER APOSTROPHE, has property Lm:

02B0..02C1    ; Diacritic # Lm  [18] MODIFIER LETTER SMALL H..MODIFIER LETTER REVERSED GLOTTAL STOP

It does not fall under any other categories, so it is PVALID.  So it is legal under the protocol to include it in a (n internationalized) domain name.

Now, it is important to recognize that, just because a Code Point is legal under IDNA2008, that is not a good reason to accept registrations of Unicode labels containing that Code Point.  As I note above, U+0027 APOSTROPHE (') is not allowed by the LDH rule, even though it's a legitimate octet under the protocol.  This makes lots of things impossible to spell in English, and can result in some funny
examples.[1]   There is a big difference between "can't" and "cant".

It is my own, personal view that the Code Point range U+02B9..U+02BD at least (and maybe all the way through U+02C1) should not be permitted to be registered (especially anywhere near the root), as a matter of registration policy.  I recognize that this makes certain words in certain languages impossible to use as U-labels.  But there is no promise at all that all the words in a language are going to make good labels, and to me reliable interoperation is more valuable than the ability to write any words one wants as labels.

Best regards,

A

[1] There used to be a domain on the Internet, for instance, for the Experts' Exchange.  It was a place where people who had technical expertise could go and help each other.  It seems to have shut down, perhaps because of the ubiquity of this kind of service.  But I remember having problems with it in the porn-filtering software in use in the library where I worked.  It looked like this:
"expertsexchange.com".  The porn filters apparently thought it was a site about expert sex change, and wouldn't allow the filtered computers to go there.

--
Andrew Sullivan
ajs at anvilwalrusden.com




More information about the cyrillic-vip mailing list