[UA-discuss] Progress on HTML and email...

Shawn Steele Shawn.Steele at microsoft.com
Tue Nov 14 22:06:07 UTC 2017


That's the discussion:  To fix the email input type to allow Unicode.  Clearly humans can't be expected to enter ALABELS, so the current email address validation spec is a non-starter. 

The EAI RFCs are quite clear that Punycode is to be avoided and apps are supposed to use Unicode for email addresses.  The expectation is that Punycoding only needs to occur during actual mail delivery when the server needs to do the DNS resolution step. And then (obviously) only for the domain name part.

I expect that without other guidance, browsers wanting to support EAI would extend the validation to allow Unicode characters > U+007F.  Clearly it would be best to formalize that in an updated spec.

-Shawn

-----Original Message-----
From: Mark Svancarek 
Sent: Tuesday, November 14, 2017 1:41 PM
To: Shawn Steele <Shawn.Steele at microsoft.com>; Andrew Sullivan <ajs at anvilwalrusden.com>; ua-discuss at icann.org
Subject: RE: [UA-discuss] Progress on HTML and email...

Here's the definition of the Email Input Type.
https://w3c.github.io/html/sec-forms.html#valid-e-mail-address  

My assertion was that the Email Input Type was mostly applicable to text entered by the user from a keyboard/IME.  I acknowledge that it could be piped in from other sources, but it seems to me that human entry from keyboard/IME is the primary use case.  Based on that assertion, I believe that the intention of the spec is to support only ALABELs as form inputs, regardless what additional processing may occur once the input is submitted.  Since it is a nongoal for ALABELs to be human-friendly, it will be unacceptably hard for any nontechnical human to use a Form based on the Email Input Type if they want to Submit an address with a Unicode domain name part, in spite of the availability of punycode converters.

Am I confused?

-----Original Message-----
From: Shawn Steele 
Sent: Tuesday, November 14, 2017 10:28 AM
To: Mark Svancarek <marksv at microsoft.com>; Andrew Sullivan <ajs at anvilwalrusden.com>; ua-discuss at icann.org
Subject: RE: [UA-discuss] Progress on HTML and email...

Um. Something's confused about that statement.  "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all."

That's not how input works on any browser.  People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode).  On Windows basically the "input" from the user to the browser is UTF-16.  All of that's irrelevant as far as the HTML spec is concerned.

When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage.

Hopefully the user's entering Unicode at Unicode email addresses, the browser sees that and sends it to the server in UTF-8 (assuming that's the negotiated encoding).  The server then sticks it in their database, hopefully in Unicode.  When some process actually sends the mail, then something low level's going to have to use Punicode encoding on the domain in order to resolve the name so the mail can be sent to the right server, but hopefully most of the stack is oblivious to that hack.  On a Windows box I don't think the application would actually need to deal with the Punicode at all (unless they wanted to do some sort of manual validation of the domain themselves).

I'd expect the email input type to be pretty much the same as the text input type - except for the extra validation that a browser might do for sanity checking.  (Which, I suppose could even include pinging the DNS to find out if it's a real mail server).

-Shawn

-----Original Message-----
From: Mark Svancarek 
Sent: Tuesday, November 14, 2017 7:49 AM
To: Andrew Sullivan <ajs at anvilwalrusden.com>; ua-discuss at icann.org; Shawn Steele <Shawn.Steele at microsoft.com>
Subject: RE: [UA-discuss] Progress on HTML and email...

Shawn is on the DL, but adding him explicitly for clarification.

-----Original Message-----
From: UA-discuss [mailto:ua-discuss-bounces at icann.org] On Behalf Of Andrew Sullivan
Sent: Monday, November 13, 2017 4:29 PM
To: ua-discuss at icann.org
Subject: Re: [UA-discuss] Progress on HTML and email...

On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
> My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
> 

No, I don't think that's it.  This is the specification for HTML, not for the UI.  The user agent can do transformations.  So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input.
(One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined".  There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.)  For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels.  Presumably, for the purposes of input they're whatever the user might input.  After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all.  We probably need someone who is working directly on browser code to say more about how this works in practice.  Maybe Shawn Steele knows?

I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5.  (Some of the principals in WHATWG don't believe that IDAN2008 is a thing.  I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in
IDNA2008 U-labels.)

Best regards,

A

--
Andrew Sullivan
ajs at anvilwalrusden.com


More information about the UA-discuss mailing list