[UA-discuss] Progress on HTML and email...

Shawn Steele Shawn.Steele at microsoft.com
Tue Nov 14 18:28:28 UTC 2017


Um. Something's confused about that statement.  "After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all."

That's not how input works on any browser.  People type things on their keyboard (or soft keyboard or whatever) and those get translated into whatever characters the browser's using for their input boxes (hopefully unicode).  On Windows basically the "input" from the user to the browser is UTF-16.  All of that's irrelevant as far as the HTML spec is concerned.

When the user submits the form, then it's up to the browser to send it to the server in the correctly negotiated encoding - hopefully that's UTF-8 for most sites and all browsers, though some negotiations could've stuck it in some really stupid limited codepage.

Hopefully the user's entering Unicode at Unicode email addresses, the browser sees that and sends it to the server in UTF-8 (assuming that's the negotiated encoding).  The server then sticks it in their database, hopefully in Unicode.  When some process actually sends the mail, then something low level's going to have to use Punicode encoding on the domain in order to resolve the name so the mail can be sent to the right server, but hopefully most of the stack is oblivious to that hack.  On a Windows box I don't think the application would actually need to deal with the Punicode at all (unless they wanted to do some sort of manual validation of the domain themselves).

I'd expect the email input type to be pretty much the same as the text input type - except for the extra validation that a browser might do for sanity checking.  (Which, I suppose could even include pinging the DNS to find out if it's a real mail server).

-Shawn

-----Original Message-----
From: Mark Svancarek 
Sent: Tuesday, November 14, 2017 7:49 AM
To: Andrew Sullivan <ajs at anvilwalrusden.com>; ua-discuss at icann.org; Shawn Steele <Shawn.Steele at microsoft.com>
Subject: RE: [UA-discuss] Progress on HTML and email...

Shawn is on the DL, but adding him explicitly for clarification.

-----Original Message-----
From: UA-discuss [mailto:ua-discuss-bounces at icann.org] On Behalf Of Andrew Sullivan
Sent: Monday, November 13, 2017 4:29 PM
To: ua-discuss at icann.org
Subject: Re: [UA-discuss] Progress on HTML and email...

On Mon, Nov 13, 2017 at 04:23:29PM +0000, Mark Svancarek wrote:
> My interpretation is that the user is a human who must enter a string of text into a web form, where it is cast as type Email (which can subsequently be converted into ULABELs if the typed-in string includes ALABELs).
> 

No, I don't think that's it.  This is the specification for HTML, not for the UI.  The user agent can do transformations.  So I _think_ it means that an email type of an input element, when it is _sent_ as input, has to be in this form; but that the input method could be different and the user agent could do a transformation on it so that Unicode user input (which could be in any form, recall) is transformed into a valid U-label/A-label pair before it becomes HTML form input.
(One feels the need for another word for "stuff that comes from the user in the UI" vs "stuff that ends up in the form as 'input' formally so defined".  There may be a term of art already in the specifications for this, but I'm not going to dig it out just now.)  For the purposes of wire transmission and storage the server-part is A-labels, but for the purposes of display they're U-labels.  Presumably, for the purposes of input they're whatever the user might input.  After all, Windows doesn't even use UTF-8 input natively, so it would literally be impossible for a Windows user to input correct UTF-8 at all.  We probably need someone who is working directly on browser code to say more about how this works in practice.  Maybe Shawn Steele knows?

I suspect this is slightly more obscure in the specification than I at least would like because of some of the WHATWG/W3C politics around HTML5.  (Some of the principals in WHATWG don't believe that IDAN2008 is a thing.  I will leave divining the consequences of using an IDNA specification that does not have a perfect 1:1 A-label/U-label mapping as an exercise for the reader, but I note that IDNA2008 doesn't solve the need for mappings: upper case characters aren't allowed in
IDNA2008 U-labels.)

Best regards,

A

--
Andrew Sullivan
ajs at anvilwalrusden.com


More information about the UA-discuss mailing list