[UA-discuss] interesting to note about emoji in mailbox name.

Mon Apr 15 16:10:14 UTC 2019

On 4/15/2019 5:24 AM, Andre Schappo wrote:
>
> I have frequently thought that one of reasons for the complexity of 
> many standards/guidelines is that they encompass the whole of Unicode 
> and hence there are few constraints and those constraints can be 
> difficult to understand and agree upon. 

In some ways, the "all of Unicode" approach looks simple: you don't have 
to worry about where to make a cutoff, or wrangle long lists of 
"acceptable" characters.

However, where it runs afoul is with the complexity of the many writing 
systems that Unicode supports. These writing systems do not play well 
with the basic assumption that underlies identifiers as "random strings 
of letters and digits" that can be intermixed freely to form (more or 
less) mnemonic values.

For many writing systems arbitrary strings of code points don't work 
well at all. Both users and rendering engines effectively expect many 
combinations to "never occur". Some combinations may not have a settled 
appearance - how to display certain clusters can be up to the font.

That's all true for code points that are on people's keyboards or 
otherwise make up the subset of common, daily use. For ancient, obsolete 
and special purpose forms that Unicode supports for academic and 
archival purposes, all bets are off.

On top of that, users (unless they are specialists) do not recognize 
them and cannot reliably distinguish them from similar-looking 
modern-use characters. They may look like an unexpected font variant, 
but not like a different character.

If you want identifiers that are mnemonic and recognizable (preferably 
well enough to not just identify them, but also being able to transcribe 
them) you'll need to sharply limit things to some "modern use" subset.

>
> I posit that with mailbox names, they can be categorised such that 
> each category is more constrained and the constraints are more easily 
> understood.
>
> A mail service provider could impose further constraints.
>
> Categories could be based on writing system/orthography. So one could 
> define Japanese, Korean, Thai ...etc... categories for mailbox names.

The first categorization that follows from the design of Unicode is that 
you need separate name spaces for each script. Too many scripts have 
overlapping (visual) repertoires while having distinct code points. 
Disallowing script mixing keeps the shape inventory to what each set of 
users expects.

If you need to support multiple scripts, you can support them 
side-by-side with proper rules that disallow names that are whole-script 
spoofs of each other.

While we should not lose sight of the difference between the formal 
rules for maibox names and domain names, these issues are in fact 
fundamentally the same. They derive from the intersection of writing 
systems and Unicode's encoding model, and no so much from the details of 
your identifier syntax or identifier matching protocol.

The statement "A mail service provider could impose further constraints" 
is the fundamental equivalent to "A registry operator could impose 
further constraints".

The problem with both is the same: neither service providers nor 
registry operators truly understand the issues with scripts and writing 
systems other than their own, or how the basic assumptions about 
text-based identifiers just don't hold up well for complex scripts.

>
> Letʼs take category Japanese: A generalised standard could, for 
> example, include some "Common" characters as well as Han, Hiragana and 
> Katakana unicode.org/Public/UCD/latest/ucd/Scripts.txt 
> <http://unicode.org/Public/UCD/latest/ucd/Scripts.txt>. A mail service 
> provider could, for example, impose a further restriction by not 
> allowing "Common" characters.
>
> I give an example of Korean mailbox names at 
> jsfiddle.net/coas/2uLhcfef <http://jsfiddle.net/coas/2uLhcfef> I only 
> allow a Korean Hangul mailbox names with the provided Korean Hangul 
> domain names.
>
> ...and... much more controversially one could define a Symbols 
> category for mailbox names. Determining which symbols could/should be 
> included in such a category would require a lot of research and 
> consideration.
>
> If I was a mail service provider I, most likely, would not allow 
> mixing of categories in mailbox names.

All these are examples that are relatively trivial, because (other than 
the sheer number of characters in East Asian writing systems) the code 
points can, in fact, be placed without restrictions.

Something that would fail in South and Central Asian scripts.

However, not allowing a mix of Kana and Hangul, for example (with or 
without Han thrown in the mix) cuts down on presenting users with labels 
that they think they understand but that contain something unexpected 
(from another category) which they will then misidentify as something 
more familiar.

About the only people who benefit from that are users intent on 
malicious use of identifiers.

That's the real danger of understanding UA as "blind acceptance" vs. 
universal support for well-behaved (if non-native) identifiers. 
"Well-behaved" almost has to become more narrowly defined than the 
"anything goes" or "any PVALID goes" from E-maul or domain name standards.

A./

>
> André Schappo
>
>> On 13 Apr 2019, at 11:28, John Levine <john.levine at standcore.com 
>> <mailto:john.levine at standcore.com>> wrote:
>>
>> In article 
>> <BYAPR21MB13171918C3D2AC0E8D177983D12F0 at BYAPR21MB1317.namprd21.prod.outlook.com 
>> <mailto:BYAPR21MB13171918C3D2AC0E8D177983D12F0 at BYAPR21MB1317.namprd21.prod.outlook.com>> 
>> you write:
>>> -=-=-=-=-=-
>>> UASG has not endorsed emojis as part of mailbox names and I doubt 
>>> that we ever would.  But as mentioned below, some mail systems will 
>>> take a more liberal approach.
>>
>> First, I have to say that I am dismayed to see that many in the UASG
>> do not know that mailboxes and domain names are different and always
>> have been.  This is an important difference, and it's discussed at
>> some length in UASG 012.  This would probably be a good time for
>> everyone who hasn't read that document to read it now, so at least we
>> agree on the underlying facts.
>>
>> As several people have pointed out, there are practically no rules for
>> what characters are technically legal in mailbox names, but that doesn't
>> mean that in practice you can put any junk in an address and expect it
>> to work.  For example, this is a valid address:
>>
>>  "); @,?~]"@m.jl.ly
>>
>> but that doesn't mean I would hand it out as an address to anyone from
>> whom I wanted mail.
>>
>> Similarly, you can technically put random combinations of Hindi,
>> Arabic, Japanese, and emojis in a mailbox, but I wouldn't expect many
>> mail systems to deliver it and if they do deliver it I would expect
>> all sorts of warnings.
>>
>> One of the glaring holes in the EAI documents is that there is no
>> practical advice on choosing mailbox names.  We have developed
>> conventions for ASCII names that LDH are fine, dots and plus signs and
>> maybe apostrophes are OK, upper and lower case ASCII are generally
>> interchagable, and beyond that you take your chances.  We need
>> appropriate guidance for mailbox names.
>>
>> Before anyone suggests it, the rule for mailboxes can NOT be the same
>> as for IDNs, since a dot is not a separator, mailboxes have always
>> allowed characters not allowed in hostnames, and mail systems have
>> always done fuzzy matching to allow misspellings that wouldn't be
>> possible in domain names.
>>
>> The IETF's PRECIS working group has advice on identifiers that would
>> be a good place to continue from.  I don't know if the IETF has the
>> energy to do that, or if people here could usefully contribute.
>>
>> R's,
>> John
>
> 🌏 🌍 🌎
> André Schappo
> 小山@电邮.在线?Subject=你好小山😜 
> <mailto:%E5%B0%8F%E5%B1%B1@%E7%94%B5%E9%82%AE.%E5%9C%A8%E7%BA%BF?Subject=%E4%BD%A0%E5%A5%BD%E5%B0%8F%E5%B1%B1%F0%9F%98%9C>
> schappo.blogspot.co.uk <https://schappo.blogspot.co.uk>
> twitter.com/andreschappo <https://twitter.com/andreschappo>
> weibo.com/andreschappo?is_all=1 <https://weibo.com/andreschappo?is_all=1>
> groups.google.com/forum/#!forum/computer-science-curriculum-internationalization 
> <https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mm.icann.org/pipermail/ua-discuss/attachments/20190415/0601e101/attachment.html>