[UA-discuss] Store domain in Punycode or Unicode?

Wed Apr 4 09:03:04 UTC 2018

I think the Unicode form should be stored. My reasons for recommending this is a little different.

Mostly, I use MySQL and phpMyAdmin for my database work. Storing IDNs and/or EAI addresses in Unicode form has advantages.

① I can search by constructing an SQL query in phpMyAdmin. eg all IDNs which contain 食品
② I can learn a lot by visual inspection eg I can readily identify text being in Korean, Thai, Sinhala, Chinese, Arabic, Cyrillic scripts

I could not do either of the above if only the punycode form is stored.

Basically, people can relate to the Unicode form and not the punycode form. So, if it involves people, store in the Unicode form.

Actually, there is one punycode label I always recognise, which is .xn--fiqs8s😀 xn--fiqs8s = 中国 = China. I recognise it because I have seen it so many times and I remember when it went live as I posted to IDNforums idnforums.com/forums/26659-china-idn-cctlds-are-live.html<http://idnforums.com/forums/26659-china-idn-cctlds-are-live.html> That is the only punycode label I recognise.

André Schappo

On 3 Apr 2018, at 20:51, Andrew Sullivan <ajs at anvilwalrusden.com<mailto:ajs at anvilwalrusden.com>> wrote:

Hi,

On Tue, Apr 03, 2018 at 06:36:03PM +0000, Carolyn Liu via UA-discuss wrote:

Today we do not allow customers to enter IDN (in Unicode) in our system (O365),
so customers can only enter domain in ASCII, or an IDN in Punycode
form.

To be clear, this means that domain names with labels of the form
xn--[punycode-goes-here] are allowed, but no non-LDH characters are
allowed in any domain name label; but, after permitting EAI addresses
you will accept UTF-8 in the local-part?

already brings a challenge for us since mail may come in as UTF8
form.

Under EAI, it _will_ come in that form.

allow our customers to enter a Unicode domain in O365, and which form we shall
store the domain – Unicode or punycode?

If you attempt to support IDNA2003 or at least some of the
compatibility modes of UTS#46, you effectively need to store both.
IDNA2003 can lose information in a round trip from Punycode-form and
Unicode-form, so you basically need to know the whole set.  This
fundamental problem was actually one of the most urgent requirements
for IDNA2008, and it's why some of us remain pretty annoyed with
UTS#46 as a strategy since one of its profiles breaks that plan
without any suggestion of how it'll eventually wean people from it.
(We didn't have a weaning suggestion either in the IDNABIS WG, which
is why we decided to break the backward compatibility in the few
cases, reasoning that pain early in deployment was less bad than pain
later.)

If you're restricting your supported domains to IDNA2008, then you
don't have to care: every actual U-label is also exactly one A-label,
and conversely.  So you can store U-labels or A-labels and get the
same result.  The usual recommendation is that you store U-labels just
because storing A-labels will result in transformation for every user
event, and that might have nasty performance effects.

1. Domains in our system is unique, meaning domain is a key. One domain shall
   only exist once and belong to one customer only.

This is true regardless of whether it's a U-label or A-label: since
they're DNS names they _must_ be unique globally within the DNS.

3. At gateway we need to know whether a domain is in our system. The match
   logic will be at follows:
    a. Is domain in system? If so go ahead and accept.
    b. If not, is it UTF8 form? If so convert to Punycode and search again.

This sounds like a round trip plan.  Why not just run it through the
relevant algorithm and check one time?  (LDH-only names will not
undergo any transformation.  You may need a coalesce function or
similar.)

4. Every time when we display, we will always convert the domain to Unicode.

This is a reason to prefer U-label forms: no conversion on display,
when the user is waiting.

5. This is how DNS supports IDN. A uniform storage will make implementation a
   lot easier.

This is true.

But if we allow to store domain in Unicode, then we have to understand those in
Punycode and those in Unicode and convert back and force.  I understand we
always need conversion, but if in only one form we know we always need to
covert to the other form, vs we might need to covert both directions
everywhere, very costly and very confusing.

It is _certainly_ true that you want to pick one, and if you already
have A-labels in the sytem then you might have a migration problem.
That might be a reason to use A-labels for storage.

A

--
Andrew Sullivan
ajs at anvilwalrusden.com<mailto:ajs at anvilwalrusden.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20180404/65bbfe1a/attachment.html>