<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 4/15/2019 5:24 AM, Andre Schappo

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:65593169-24EE-4766-AEAD-9D2F27F11B36@lboro.ac.uk">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div class=""><br class="">

      </div>

      I have frequently thought that one of reasons for the complexity

      of many standards/guidelines is that they encompass the whole of

      Unicode and hence there are few constraints and those constraints

      can be difficult to understand and agree upon.

    </blockquote>

    <p>In some ways, the "all of Unicode" approach looks simple: you

      don't have to worry about where to make a cutoff, or wrangle long

      lists of "acceptable" characters.</p>

    <p>However, where it runs afoul is with the complexity of the many

      writing systems that Unicode supports. These writing systems do

      not play well with the basic assumption that underlies identifiers

      as "random strings of letters and digits" that can be intermixed

      freely to form (more or less) mnemonic values.</p>

    <p>For many writing systems arbitrary strings of code points don't

      work well at all. Both users and rendering engines effectively

      expect many combinations to "never occur". Some combinations may

      not have a settled appearance - how to display certain clusters

      can be up to the font.</p>

    <p>That's all true for code points that are on people's keyboards or

      otherwise make up the subset of common, daily use. For ancient,

      obsolete and special purpose forms that Unicode supports for

      academic and archival purposes, all bets are off.</p>

    <p>On top of that, users (unless they are specialists) do not

      recognize them and cannot reliably distinguish them from

      similar-looking modern-use characters. They may look like an

      unexpected font variant, but not like a different character.</p>

    <p>If you want identifiers that are mnemonic and recognizable

      (preferably well enough to not just identify them, but also being

      able to transcribe them) you'll need to sharply limit things to

      some "modern use" subset.</p>

    <blockquote type="cite"

      cite="mid:65593169-24EE-4766-AEAD-9D2F27F11B36@lboro.ac.uk">

      <div class=""><br class="">

      </div>

      <div class="">I posit that with mailbox names, they can be

        categorised such that each category is more constrained and the

        constraints are more easily understood.</div>

      <div class=""><br class="">

      </div>

      <div class="">A mail service provider could impose further

        constraints.</div>

      <div class=""><br class="">

      </div>

      <div class="">Categories could be based on writing

        system/orthography. So one could define Japanese, Korean, Thai

        ...etc... categories for mailbox names.</div>

    </blockquote>

    <p>The first categorization that follows from the design of Unicode

      is that you need separate name spaces for each script. Too many

      scripts have overlapping (visual) repertoires while having

      distinct code points. Disallowing script mixing keeps the shape

      inventory to what each set of users expects.<br>

    </p>

    <p>If you need to support multiple scripts, you can support them

      side-by-side with proper rules that disallow names that are

      whole-script spoofs of each other.</p>

    <p>While we should not lose sight of the difference between the

      formal rules for maibox names and domain names, these issues are

      in fact fundamentally the same. They derive from the intersection

      of writing systems and Unicode's encoding model, and no so much

      from the details of your identifier syntax or identifier matching

      protocol.<br>

    </p>

    <p>The statement "A mail service provider could impose further

      constraints" is the fundamental equivalent to "A registry operator

      could impose further constraints".</p>

    <p>The problem with both is the same: neither service providers nor

      registry operators truly understand the issues with scripts and

      writing systems other than their own, or how the basic assumptions

      about text-based identifiers just don't hold up well for complex

      scripts.<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

      cite="mid:65593169-24EE-4766-AEAD-9D2F27F11B36@lboro.ac.uk">

      <div class=""><br class="">

      </div>

      <div class="">Letʼs take category Japanese: A generalised standard

        could, for example, include some "Common" characters as well as

        Han, Hiragana and Katakana <a

          href="http://unicode.org/Public/UCD/latest/ucd/Scripts.txt"

          class="" moz-do-not-send="true">unicode.org/Public/UCD/latest/ucd/Scripts.txt</a>.

        A mail service provider could, for example, impose a further

        restriction by not allowing "Common" characters.</div>

      <div class=""><br class="">

      </div>

      <div class="">I give an example of Korean mailbox names at <a

          href="http://jsfiddle.net/coas/2uLhcfef" class=""

          moz-do-not-send="true">jsfiddle.net/coas/2uLhcfef</a> I only

        allow a Korean Hangul mailbox names with the provided Korean

        Hangul domain names.</div>

      <div class=""><br class="">

      </div>

      <div class="">...and... much more controversially one could define

        a Symbols category for mailbox names. Determining which symbols

        could/should be included in such a category would require a lot

        of research and consideration.</div>

      <div class=""><br class="">

      </div>

      <div class="">If I was a mail service provider I, most likely,

        would not allow mixing of categories in mailbox names.</div>

    </blockquote>

    <p>All these are examples that are relatively trivial, because

      (other than the sheer number of characters in East Asian writing

      systems) the code points can, in fact, be placed without

      restrictions.</p>

    <p>Something that would fail in South and Central Asian scripts.<br>

    </p>

    <p>However, not allowing a mix of Kana and Hangul, for example (with

      or without Han thrown in the mix) cuts down on presenting users

      with labels that they think they understand but that contain

      something unexpected (from another category) which they will then

      misidentify as something more familiar.</p>

    <p>About the only people who benefit from that are users intent on

      malicious use of identifiers.</p>

    <p>That's the real danger of understanding UA as "blind acceptance"

      vs. universal support for well-behaved (if non-native)

      identifiers. "Well-behaved" almost has to become more narrowly

      defined than the "anything goes" or "any PVALID goes" from E-maul

      or domain name standards.</p>

    <p>A./<br>

    </p>

    <blockquote type="cite"

      cite="mid:65593169-24EE-4766-AEAD-9D2F27F11B36@lboro.ac.uk">

      <div class=""><br class="">

      </div>

      <div class="">André Schappo</div>

      <div class=""><br class="">

      </div>

      <div class="">

        <div>

          <blockquote type="cite" class="">

            <div class="">On 13 Apr 2019, at 11:28, John Levine <<a

                href="mailto:john.levine@standcore.com" class=""

                moz-do-not-send="true">john.levine@standcore.com</a>>

              wrote:</div>

            <br class="Apple-interchange-newline">

            <div class="">

              <div class="">In article <<a

href="mailto:BYAPR21MB13171918C3D2AC0E8D177983D12F0@BYAPR21MB1317.namprd21.prod.outlook.com"

                  class="" moz-do-not-send="true">BYAPR21MB13171918C3D2AC0E8D177983D12F0@BYAPR21MB1317.namprd21.prod.outlook.com</a>>

                you write:<br class="">

                <blockquote type="cite" class="">-=-=-=-=-=-<br class="">

                  UASG has not endorsed emojis as part of mailbox names

                  and I doubt that we ever would.  But as mentioned

                  below, some mail systems will take a more liberal

                  approach.<br class="">

                </blockquote>

                <br class="">

                First, I have to say that I am dismayed to see that many

                in the UASG<br class="">

                do not know that mailboxes and domain names are

                different and always<br class="">

                have been.  This is an important difference, and it's

                discussed at<br class="">

                some length in UASG 012.  This would probably be a good

                time for<br class="">

                everyone who hasn't read that document to read it now,

                so at least we<br class="">

                agree on the underlying facts.<br class="">

                <br class="">

                As several people have pointed out, there are

                practically no rules for<br class="">

                what characters are technically legal in mailbox names,

                but that doesn't<br class="">

                mean that in practice you can put any junk in an address

                and expect it <br class="">

                to work.  For example, this is a valid address:<br

                  class="">

                <br class="">

                 "); @,?~]"@m.jl.ly<br class="">

                <br class="">

                but that doesn't mean I would hand it out as an address

                to anyone from<br class="">

                whom I wanted mail.<br class="">

                <br class="">

                Similarly, you can technically put random combinations

                of Hindi,<br class="">

                Arabic, Japanese, and emojis in a mailbox, but I

                wouldn't expect many<br class="">

                mail systems to deliver it and if they do deliver it I

                would expect<br class="">

                all sorts of warnings.<br class="">

                <br class="">

                One of the glaring holes in the EAI documents is that

                there is no<br class="">

                practical advice on choosing mailbox names.  We have

                developed<br class="">

                conventions for ASCII names that LDH are fine, dots and

                plus signs and<br class="">

                maybe apostrophes are OK, upper and lower case ASCII are

                generally<br class="">

                interchagable, and beyond that you take your chances.

                 We need<br class="">

                appropriate guidance for mailbox names.  <br class="">

                <br class="">

                Before anyone suggests it, the rule for mailboxes can

                NOT be the same<br class="">

                as for IDNs, since a dot is not a separator, mailboxes

                have always<br class="">

                allowed characters not allowed in hostnames, and mail

                systems have<br class="">

                always done fuzzy matching to allow misspellings that

                wouldn't be<br class="">

                possible in domain names.<br class="">

                <br class="">

                The IETF's PRECIS working group has advice on

                identifiers that would<br class="">

                be a good place to continue from.  I don't know if the

                IETF has the<br class="">

                energy to do that, or if people here could usefully

                contribute.<br class="">

                <br class="">

                R's,<br class="">

                John<br class="">

              </div>

            </div>

          </blockquote>

        </div>

        <br class="">

        <div class="">

          <div dir="auto" style="word-wrap: break-word;

            -webkit-nbsp-mode: space; line-break: after-white-space;"

            class="">

            <div style="color: rgb(0, 0, 0); font-family: "Arial

              Unicode MS"; font-size: 14px; font-style: normal;

              font-variant-caps: normal; font-weight: normal;

              letter-spacing: normal; text-align: start; text-indent:

              0px; text-transform: none; white-space: normal;

              word-spacing: 0px; -webkit-text-stroke-width: 0px;">

              🌏 🌍 🌎<br class="">

              André Schappo<br class="">

              <a

href="mailto:%E5%B0%8F%E5%B1%B1@%E7%94%B5%E9%82%AE.%E5%9C%A8%E7%BA%BF?Subject=%E4%BD%A0%E5%A5%BD%E5%B0%8F%E5%B1%B1%F0%9F%98%9C"

                class="" moz-do-not-send="true">小山@电邮.在线?Subject=你好小山😜</a><br

                class="">

              <a href="https://schappo.blogspot.co.uk" class=""

                moz-do-not-send="true">schappo.blogspot.co.uk</a><br

                class="">

              <a href="https://twitter.com/andreschappo" class=""

                moz-do-not-send="true">twitter.com/andreschappo</a><br

                class="">

              <a href="https://weibo.com/andreschappo?is_all=1" class=""

                moz-do-not-send="true">weibo.com/andreschappo?is_all=1</a><br

                class="">

              <a

href="https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization"

                class="" moz-do-not-send="true">groups.google.com/forum/#!forum/computer-science-curriculum-internationalization</a><br

                class="">

              <br class="">

            </div>

          </div>

        </div>

        <br class="">

      </div>

    </blockquote>

    <p><br>

    </p>

  </body>

</html>