[UA-discuss] Store domain in Punycode or Unicode?

Andrew Sullivan ajs at anvilwalrusden.com
Tue Apr 3 19:51:28 UTC 2018


On Tue, Apr 03, 2018 at 06:36:03PM +0000, Carolyn Liu via UA-discuss wrote:
> Today we do not allow customers to enter IDN (in Unicode) in our system (O365),
> so customers can only enter domain in ASCII, or an IDN in Punycode
> form.

To be clear, this means that domain names with labels of the form
xn--[punycode-goes-here] are allowed, but no non-LDH characters are
allowed in any domain name label; but, after permitting EAI addresses
you will accept UTF-8 in the local-part?

> already brings a challenge for us since mail may come in as UTF8
> form.

Under EAI, it _will_ come in that form.

> allow our customers to enter a Unicode domain in O365, and which form we shall
> store the domain – Unicode or punycode?

If you attempt to support IDNA2003 or at least some of the
compatibility modes of UTS#46, you effectively need to store both.
IDNA2003 can lose information in a round trip from Punycode-form and
Unicode-form, so you basically need to know the whole set.  This
fundamental problem was actually one of the most urgent requirements
for IDNA2008, and it's why some of us remain pretty annoyed with
UTS#46 as a strategy since one of its profiles breaks that plan
without any suggestion of how it'll eventually wean people from it.
(We didn't have a weaning suggestion either in the IDNABIS WG, which
is why we decided to break the backward compatibility in the few
cases, reasoning that pain early in deployment was less bad than pain

If you're restricting your supported domains to IDNA2008, then you
don't have to care: every actual U-label is also exactly one A-label,
and conversely.  So you can store U-labels or A-labels and get the
same result.  The usual recommendation is that you store U-labels just
because storing A-labels will result in transformation for every user
event, and that might have nasty performance effects.
>  1. Domains in our system is unique, meaning domain is a key. One domain shall
>     only exist once and belong to one customer only.

This is true regardless of whether it's a U-label or A-label: since
they're DNS names they _must_ be unique globally within the DNS.

>  3. At gateway we need to know whether a domain is in our system. The match
>     logic will be at follows:
>      a. Is domain in system? If so go ahead and accept.
>      b. If not, is it UTF8 form? If so convert to Punycode and search again.

This sounds like a round trip plan.  Why not just run it through the
relevant algorithm and check one time?  (LDH-only names will not
undergo any transformation.  You may need a coalesce function or

>  4. Every time when we display, we will always convert the domain to Unicode.

This is a reason to prefer U-label forms: no conversion on display,
when the user is waiting.

>  5. This is how DNS supports IDN. A uniform storage will make implementation a
>     lot easier.

This is true.

> But if we allow to store domain in Unicode, then we have to understand those in
> Punycode and those in Unicode and convert back and force.  I understand we
> always need conversion, but if in only one form we know we always need to
> covert to the other form, vs we might need to covert both directions
> everywhere, very costly and very confusing.

It is _certainly_ true that you want to pick one, and if you already
have A-labels in the sytem then you might have a migration problem.
That might be a reason to use A-labels for storage.


Andrew Sullivan
ajs at anvilwalrusden.com

More information about the UA-discuss mailing list