[UA-discuss] The Open Dot as a label delimiter in Chinese and Japanese

Tue Nov 7 14:14:19 UTC 2017

Hi,

On Mon, Nov 06, 2017 at 04:23:50PM -0800, Jim DeLaHunt wrote:

> So, RFC5895 "Mapping Characters for Internationalized Domain Names in
> Applications (IDNA) 2008" <https://www.rfc-editor.org/rfc/rfc5895.txt>,
> section 2 "The General Procedure", says,
> 
>    4. If an implementation of this mapping is also performing the step
>    of separation of the parts of a domain name into labels by using the
>    FULL STOP character (U+002E), the IDEOGRAPHIC FULL STOP character
>    (U+3002) can be mapped to the FULL STOP before label separation
>    occurs. There are other characters that are used as "full stops"
>    that one could consider mapping as label separators, but their use
>    as such has not been investigated thoroughly.

Yes.

> And UTS #46 "Unicode IDNA Compatibility Processing"
> <http://www.unicode.org/reports/tr46/>, section 2.3 "Notation", says,
> 
>    In this document, a label is a substring of a domain name. That
>    substring is bounded on both sides by either the start or the end of
>    the string, or any of the following characters, called label-separators:
> 
>     1. U+002E ( . ) FULL STOP
>     2. U+FF0E ( ． ) FULLWIDTH FULL STOP
>     3. U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
>     4. U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP

Note that these would be covered by RFC 5895 too, by step 2 (where the
fillwidth and halfwidth characters are decomposed), but it's a more
general mechanism than that outlined by UTS #46.

I think it is worth pointing out that UTS #46 is a pretty serious burr
under the saddle in the relationship between the UTC and the IETF.
This is partly because UTS #46 explicitly permits a number of labels
that are clearly not permitted under IDNA2008 (see e.g., "For
transitional use, the Compatibility Processing also allows domain
names containing symbols and punctuation that were valid in IDNA2003,
such as √.com (which has an associated web page). Such domain names
containing symbols will gradually disappear as registries shift to
IDNA2008.")  In the IETF, when we have transition mechanisms we are
generally required to specify how they work, or else they are regarded
as hand-waving.  There is basically no mechanism for such transition
in UTS#46 ("registries shift to IDNA2008" is the very same transition
as "implement IDNA2008", so it's not a mapping at all).  The UTC is
plainly the expert in the relevant character encodings and how that
all functions within applications, but it is also plainly deficient in
expertise in the area of network protocols, and the gap shows.  The
fact that the UTC and the IETF have been so far incapable of
collaborating on this topic is IMO a problem.

Part of the disagreement comes from a different stance: the IETF's
general belief is that, if you're going to fail, declare failure early
and then replace the bad protocol (and break stuff if you have to).
UTC's approach maximises stability, which means that once something is
out in the world you're more or less stuck with it (with a few limited
exceptions).  INDA2008 was intended to break certain cases early on
the grounds that we could already see they were a problem; the most
obvious ones were nailing the protocol to a version of Unicode and the
expansion of the repertoire beyond LDH analogues.  UTS#46's approach
is, alas, delaying the reckoning with that damage, and may well have
put it off forever (the WHATWG's approach to all of this hasn't
helped).

> From my point of view as a UASG explainer, this is good an sufficient
> grounding for a recommendation that apps treat U+3002 as a label separator.
> I would go further and warn people that this list might grow; that U+FF0E
> and U+FF61 may be on their way.

That's reasonable, yes, but I would not go too far.  It's worth
remembering that domain names are, at bottom, protocol elements.
There's only so much munging one can do to protocol elements without
introducing ambiguities that can be exploited by attackers.

> (Interesting, I just noticed that UASG007 also recommends treating the
> Arabic full stop character “۔” (U+06D4) as a label separator. UTS #46 and
> RFC5885 don't mention that.)

Yeah, it hadn't been generally studied at the time, and I'm still not
sure that the recommendation is ideal.  I have heard but am not sure
that in some Arabic-using writing systems (not the majority ones),
there is some problem in the handling of that code point.  I'm not
clear on the details, but the population of languages that use Arabic
characters for non-Arabic languages is way larger than the Han case.

Best regards,

A

-- 
Andrew Sullivan
ajs at anvilwalrusden.com