[Cyrillic-vip] On U+02BC

Andrew Sullivan ajs at anvilwalrusden.com
Thu Jul 7 20:40:41 UTC 2011


Dear colleagues,

On our call today, the issue of the character U+02BC, MODIFIER LETTER
APOSTROPHE, came up.  The character looks like this: ʼ I said I'd
follow up on it.  Here's what I learned.  The background section below
might be familiar to many of you, and if so you can feel free to skip
it.

1.  Background

    1.1  Policy in the traditional DNS

It is important to recall that RFCs 1034 and 1035 came along with
implicit policies.  There is nothing about the DNS that excludes
apostrophes (') and quotation marks (") from appearing in zones.  On
the contrary, DNS labels may contain any octets at all.  But the DNS
standard says that, for maximum compatibility, it would be better to
stick to the "hostname syntax" -- i.e. to use the "LDH rule" for
domains.  In the interests of security and stability, that rule has
persisted.

    1.2  What's allowed by RFC 5892

IDNA2008 has a tricky mechanism for deciding whether a Code Point is
included in the protocol.  The rule is in RFC 5892.  Basically, you
calculate a classification for the code point according to some
derived properties.  The properties entail a certain status; anything
not covered is automatically DISALLOWED.  The algorithm is written
this way in Section 3 of RFC 5892:


   If .cp. .in.  Exceptions Then Exceptions(cp);
   Else If .cp. .in.  BackwardCompatible Then BackwardCompatible(cp);
   Else If .cp. .in.  Unassigned Then UNASSIGNED;
   Else If .cp. .in.  LDH Then PVALID;
   Else If .cp. .in.  JoinControl Then CONTEXTJ;
   Else If .cp. .in.  Unstable Then DISALLOWED;
   Else If .cp. .in.  IgnorableProperties Then DISALLOWED;
   Else If .cp. .in.  IgnorableBlocks Then DISALLOWED;
   Else If .cp. .in.  OldHangulJamo Then DISALLOWED;
   Else If .cp. .in.  LetterDigits Then PVALID;
   Else DISALLOWED;

".cp." stands for "code point", i.e. "this code point under
consideration".  As you can see, there are two cases where a class
automatically results in PVALID: LDH and LetterDigits.  The order of
these steps is significant.

LDH is the set of everything in the traditional LDH label.  

LetterDigits is defined by the property of the Code Point established
by Unicode.  Anything with a property of any one of {Ll, Lu, Lo, Nd,
Lm, Mn, Mc} will be PVALID _as long as_ it does not fall into any of
the categories that go before it.  For instance, capital letters would
qualify as LetterDigits, except that they're captured by the Unstable
rule first (because they're not stable under NFKC and case folding).  

2.  The status of U+02BC

It turns out, perhaps surprisingly, that U+02BC, MODIFIER LETTER
APOSTROPHE, has property Lm:

02B0..02C1    ; Diacritic # Lm  [18] MODIFIER LETTER SMALL H..MODIFIER LETTER REVERSED GLOTTAL STOP

It does not fall under any other categories, so it is PVALID.  So it
is legal under the protocol to include it in a (n internationalized)
domain name.

Now, it is important to recognize that, just because a Code Point is
legal under IDNA2008, that is not a good reason to accept
registrations of Unicode labels containing that Code Point.  As I note
above, U+0027 APOSTROPHE (') is not allowed by the LDH rule, even
though it's a legitimate octet under the protocol.  This makes lots of
things impossible to spell in English, and can result in some funny
examples.[1]   There is a big difference between "can't" and "cant".

It is my own, personal view that the Code Point range U+02B9..U+02BD
at least (and maybe all the way through U+02C1) should not be
permitted to be registered (especially anywhere near the root), as a
matter of registration policy.  I recognize that this makes certain
words in certain languages impossible to use as U-labels.  But there
is no promise at all that all the words in a language are going to
make good labels, and to me reliable interoperation is more valuable
than the ability to write any words one wants as labels.

Best regards,

A

[1] There used to be a domain on the Internet, for instance, for the
Experts' Exchange.  It was a place where people who had technical
expertise could go and help each other.  It seems to have shut down,
perhaps because of the ubiquity of this kind of service.  But I
remember having problems with it in the porn-filtering software in use
in the library where I worked.  It looked like this:
"expertsexchange.com".  The porn filters apparently thought it was a
site about expert sex change, and wouldn't allow the filtered
computers to go there.

-- 
Andrew Sullivan
ajs at anvilwalrusden.com



More information about the cyrillic-vip mailing list