[UA-discuss] Regular Expression

Thu Sep 14 17:27:01 UTC 2017

The BiDi issue suggests to me that even enforcing the non-dotless rule is too much for a simple regex, as shabaka.example at don is a valid Arabic EAI , while the same ASCII combination is not valid even if a .don TLD gets delegated.
[non-empty]@[non-empty] looks better to me. 

Rubens

> Em 14 de set de 2017, à(s) 13:58:000, Don Hollander <don.hollander at icann.org> escreveu:
> 
> Thanks Jim.
> 
> The BiDi issue, with raw data input, is which side has the domain side.
> 
> usually you’ll encounter mailbox at domainname.tld
> 
> But in Arabic or Hebrew you’ll encounter tld.domainname at mailbox
> 
> Don
> 
> 
>> On 15/09/2017, at 3:44 AM, Jim Hague <jim at sinodun.com> wrote:
>> 
>> On 12/09/2017 19:44, Don Hollander wrote:
>>> One RegEx has stood out as being simple and correct.   I’d like the UASG
>>> to consider recommending this in our documentation.   Toward that end,
>>> this thread is for discussion.
>>> 
>>> /^.+@(?:[^.]+\.)+(?:[^.]{2,})$
>>> 
>>> Regular expression check in Javascript. This accepts any Unicode
>>> characters, only insisting that the domain must have more than one label
>>> and the TLD is 2 characters or longer. 
>> 
>> Note that this in the context of an in-browser check. I only examined a
>> small random subset of the sites surveyed in the main evaluation, and
>> obviously without access to server code could only examine client-side
>> operations. In all the sites I examined, the only check performed was
>> against one (or in one case two) regular expression(s). No decomposition
>> of the email address was attempted, and certainly no translation of the
>> domain to Punycode.
>> 
>> It was in that context that I highlighted the above regex, on the basis
>> that it's probably the only sensible option to suggest to organisations
>> as a low-impact UA improvement (I won't say fix) at the moment. If a
>> future evaluation exercise verifies that an existing Javascript module
>> does the right thing, that would be a better alternative, but that would
>> involve more substantial modifications to site code.
>> 
>> I agree that modifying it to allow 1 character TLDs would be sensible.
>> 
>> I also agree with the page referenced at the start of the thread (which
>> I read before working on the report) that just checking for '@' is about
>> all one should attempt, certainly client-side.
>> 
>> Turning again to the above regex, of course, being a proposed regex for
>> validating email addresses, it's got an obvious deficiency. It needs to
>> add support for other label separators (e.g. open dot).
>> 
>> Mark Svancarek raised the excellent point of bidi in the domain.
>> Personally I'm not confident I understand the bidi rules. But if the
>> regex requires at least one label separator character in the domain and
>> non-empty labels, will that work, given that if the regex allows 1
>> character TLDs then a valid TLD is simply a non-empty label?
>> -- 
>> Jim Hague - jim at sinodun.com          Never trust a computer you can't lift.
> 
> Don Hollander
> Universal Acceptance Steering Group
> Skype: don_hollander
> 
> 
>